A
The article is devoted to the problem of how to automatically measure the interpretability of topic models. Some new, intra-text, approaches to estimate the interpretability of the topics are proposed. Computational experiments are conducted with the use of text files from “PostNauka”, which is a collection of popular science content.
In this paper, we explore the ways to improve POS-tagging using various
types of auxiliary losses and different word representations. As a baseline,
we utilized a BiLSTM tagger, which is able to achieve state-of-the-art results
on the sequence labelling tasks. We developed a new method for characterlevel
word representation using feedforward neural network. Such representation
gave us better results in terms of speed and performance of the
model. We also applied a novel technique of pretraining such word representations
with existing word vectors. Finally, we designed a new variant
of auxiliary loss for sequence labelling tasks: an additional prediction of the
neighbour labels. Such loss forces a model to learn the dependencies inside
a sequence of labels and accelerates the process of training. We test
these methods on English and Russian languages.
This paper discusses a method to detect statistically significant linguistic differences between corpora while factoring in possible variability within the very corpora to be compared. Specifically, we compare two small corpora of dialects of Even, Bystraja and Lamunkhin Even, in an attempt to identify
morphemes that are more frequent in either of the corpora. To investigate whether this difference might be due to an over-representation of a speaker
who happens to be an outlier in terms of using a particular morpheme,
we use DP, a measurement of evenness of the distribution of a specific linguistic
feature across subcorpora of the same corpus.
The paper deals with a curious phenomenon of quasi-synonymy that occurs
in Russian between sentences with non-negated and negated predicates
in the construction with the adverb dolgo ‘for a long time’. Consider sentences
like Chainik dolgo zakipal ‘It took the kettle a long time to boil, lit. Kettle
for a long time boiled’ vs. Chainik dolgo ne zakipal ‘It took the kettle a long
time to boil, lit. Kettle for a long time not boiled’. The paper is an attempt
to define the semantic and pragmatic mechanisms of such quasi-synonymy, as well as semantic and aspectual classes of predicates where it occurs.
It also considers subtle semantic, pragmatic and communicative differences
associated with non-negated and negated construction, respectively.
Such quasi-synonymy occurs primarily in cases when the predicate belongs
to the aspectual class of accomplishments and denotes a telic process
or action with a desired result (‘to boil’, ‘to cool down’, ‘to warm up’,
‘to grow up’, ‘to finish’, etc.). Those predicates include two major semantic
components, that is, a lasting process or action and an instant result. In the
imperfective aspect they allow at least two possible interpretations, namely,
of a process and that of a result. Similar interpretations of sentences with
such predicates occur due to different scope assignments of negation and
dolgo. In sentences with non-negated predicate dolgo has scope over the
‘process’ component in the verb; in sentences with negated predicate negation
has scope over the ‘result’ component of the verb while at the same
time falling into the scope of dolgo. The former type of sentences describes
long-lasting processes, whereas the latter type describes long-awaited results,
which pragmatically amount to the same thing.
The paper is a corpus study of the factors involved in disambiguating potential
scope ambiguity in written sentences with negation and universal
quantifier all, such as I cannot visit all these universities, which, depending
on topic-focus assignment, can alternatively mean ‘I cannot visit any
of these universities’ (cannot is focus) and ‘I cannot visit some of these universities’
(all is focus). The factors at play in scope disambiguation are the
syntactic function of the constituent containing all (subject, direct complement,
adjunct); the status of the main predicate and all with respect to the
information structure of the utterance (topic vs. focus); veridical vs. nonveridical
context; sentence type (unreal conditional, rhetorical question);
and pragmatic implicatures pertaining to the situations described in the utterances.
The paper also demonstrates differences in the frequency distribution
of various scope readings and their underlying causes, as well as formulating
typical contexts for each scope interpretation.
The paper describes our participation in the first shared task on word sense
induction and disambiguation for the Russian language RUSSE’2018 (Panchenko
et al., 2018). For each of several dozens of ambiguous words, the
participants were asked to group text fragments containing it according
to the senses of this word, which were not provided beforehand, therefore
the „induction“ part of the task. For instance, a word “bank” and a set of text
fragments (also known as “contexts”) in which this word occurs, e.g. “bank
is a financial institution that accepts deposits” and “river bank is a slope beside
a body of water” were given. A participant was asked to cluster such
contexts in the unknown in advance number of clusters corresponding to,
in this case, the “company” and the “area” senses of the word “bank”. The
organizers proposed three evaluation datasets of varying complexity and
text genres based respectively on texts of Wikipedia, Web pages, and a dictionary
of the Russian language.
We present two experiments: a positive and a negative one, based respectively
on clustering of contexts represented as a weighted average
of word embeddings and on machine translation using two state-of-the-art
production neural machine translation systems. Our team showed the second
best result on two datasets and the third best result on the remaining
one dataset among 18 participating teams. We managed to substantially
outperform competitive state-of-the-art baselines from the previous years
based on sense embeddings.
Morphological segmentation is an important task of natural language processing
as it can significantly improve the processing of unfamiliar and
rare words in different tasks that involve text data. In this paper we present
datasets in English and Russian for learning and evaluating morphological
segmentation algorithms, demonstrate the method based on the sequence
to sequence neural model and show that the proposed approach shows
better results in comparison with other existing methods of morpheme segmentation.
We start from an English dataset, which is already available and
only minor preprocessing has been made, and then we experiment with the
Russian language, where we could not obtain prepared data. So, some more
serious preprocessing issues are included. Moreover, we demonstrate how
morphological segmentation can improve another natural language processing
task—evaluation of words semantic similarity. To achieve this goal,
first we try to reproduce the best results of the participants of Russian words
semantic similarity competition (RUSSE), which was conducted in Dialogue
2015 conference. Then we show how with the help of smart morpheme segmentation
these results can be advanced.
B
Framework for Russian plagiarism detection using sentence embedding similarity and negative sampling
In this paper, we propose a new approach for advanced plagiarism detection
in Russian language. It is based on a classifier, dealing with two different
types of sentence similarity measures: token set similarity and cosine similarity
between sentence embeddings (based on pre-trained RusVectōrēs,
unsupervised fastText, and supervised StarSpace models). The diversity
of feature space makes it possible to detect different types of plagiarism,
starting from simple copy&paste cases and ending with complex manual
paraphrases. The proposed approach implies an ability to focus on the
particular plagiarism type identification, allowing to train a universal model
at the same time. The method shows great results on detection of different
types of plagiarism and outperforms the previous approach.
Generic topics of large-scale document collections can often be divided into
more specific subtopics. Topic hierarchies provide a model for such topic
relation structure. These models can be especially useful for exploratory
search systems. Various approaches to building hierarchical topic models
have been proposed so far. However, there is no agreement on a standard
approach, largely due to the lack of quality metrics to compare existing
models. To bridge this gap we propose automated evaluation metrics which
measure the quality of topic-subtopic relations (edges) of a topic hierarchy.
We compare automated evaluations with human assessment to validate the
proposed metrics. Finally, we show how the proposed metrics can be used
to control and to improve the quality of existing hierarchical models.
The paper describes a new version of the semantic analyzer SemETAP.
Our approach is based on the assumption that the depth of understanding
is growing with the number of inferences we can draw from the text.
The salient features of SemETAP include: 1) intensive use of both linguistic
and background knowledge. The former is incorporated in the Combinatorial
Dictionary and the Grammar, and the latter is stored in the Ontology
and Repository of Individuals. 2) Words and concepts of the ontology may
be supplied with explicit decompositions for inference purposes. 3) Two
levels of semantic structure are distinguished. Basic semantic structure
(BSemS) interprets the text in terms of ontological elements. Enhanced
semantic structure (EnSemS) extends BSemS by means of a series of inferences.
4) A new logical formalism Etalog is developed in which all inference
rules are written. Semantic analysis with inference allows us to extract
implicit information. The analyzer is tested on the task of interpreting high
spots of the football match.
Subject index, or back-of-the-book index, is a device intended to provide
an easy access to relevant fragments of a text document. Subject indexes
usually contain particular single-word and multi-word terms from the corresponding
documents. Such indexes are especially useful for reading
large documents with specialized terminology, as well as educational texts
in difficult scientific and technical areas. The central problem of back-ofthe-book
indexing is recognition of terms to be included into the index.
The paper describes a method developed for extracting and filtering terms
from a given educational scientific text, with the purpose of reliable term
selection in computer indexing systems. The method is primarily based
on rules with lexico-syntactic patterns representing linguistic information
about terms and typical contexts of their usage in Russian scientific and
educational texts; simple occurrences statistics of terms is used as well.
Experimental evaluation of the method has shown a considerable increase
of precision and recall of term extraction compared with the widely-used
standard techniques.
This paper addresses the task of automatic genre classification for Arabic
within the Functional Text Dimensions framework, which allows texts to get
a reliable genre description, while maintaining an adequate amount of genre
labels. Our aim in this study is to build an automatic classification model that
can annotate any Web text in Standard Arabic in terms of genres. To build
the training corpus we translated English and Russian annotated texts into
Arabic using Google MT. For building the model experimented with various
machine learning approaches, such as Logistic Regression, SVM, LSTM,
and different features, such as words, character n-grams and embedding
vectors. For testing the classification models, we collected and annotated
in terms of FTDs our own corpus of Arabic Web texts. The best performing
model offers reasonable classification accuracy in spite of being based
on a training corpus produced by MT.
D
The notion of event boundaries is closely connected with the category of aspect.
Aspectual forms show different views of “internal temporal consistuency
of a situation” (Comrie 1976:3) and, consequently, construals of events
in different ways. Recently scholars have started looking into the core of the
aspectual distinction through multimodality, considering hand gestures.
On the basis of Russian and French oral narratives produced by native
speakers, we conducted a study, testing our hypothesis about the existence
of direct correlation between the expression of boundaries in verbs and
in gestures. Means of boundary expression regarded for Russian on the verbal
level were perfective (soveršennyj vid) and imperfective (nesoveršennyj
vid) verbs, and for French—passé composé and imparfait. On the kinesthetic
level we distinguished between bounded gestures (i.e., involving
a pulse of movement) and unbounded gestures (i.e., smooth by nature).
While for French L1 we found a direct correlation between gesture boundary
schemas and aspectual forms, the results for Russian L1 did not support
our hypothesis. With a view to these differences between the two languages,
we studied the boundedness correlation in oral narratives produced by Russians
speaking French as L2 (CEFR levels B2-C1). The comparison between
L1 and L2 narratives revealed a certain change of gestural patterns: the Russian
speakers of French L2 used almost the same number of unbounded
and bounded gestures with the perfective verb forms and more unbounded
gestures with the imperfective forms, thus moving closer towards French
L1 speakers’ verb-gesture patterns. The use of gestures can be accounted
for by a series of noise factors related to language peculiarities, the cognitive
mechanism of profiling and challenges of speaking in L2.
В статье излагаются принципы контрастивного корпусного исследо-
вания немецких и русских модальных конструкций. Ставится задача,
во-первых, уточнить номенклатуру значений немецких модальных
глаголов и условий их реализации, а во-вторых, выявить и описать
средства выражения модальных значений в русском языке на основе
анализа множества конструкций, служащих функциональными экви-
валентами при переводе на русский язык конструкций с немецкими
модальными глаголами. Анализ предлагается осуществлять при по-
мощи создания на основе репрезентативного массива параллель-
ных немецко-русских текстов Национального корпуса русского языка
(НКРЯ) надкорпусной базы данных переводных соответствий, в кото-
рой как немецкой конструкции с модальным глаголом, так и ее рус-
скому переводному эквиваленту приписывается аннотация в форме
набора значений релевантных признаков. Такая база данных, с одной
стороны, будет представлять собой ценный лингвистический ресурс,
который может быть использован, в том числе, для создания нового
поколения электронных интерактивных немецко-русских и русско-не-
мецких словарей; с другой стороны, построенная на основе этой базы
данных инвентаризация типов конструкций русского языка с (потенци-
альным) значением модальности составит важный вклад в грамматику
конструкций русского языка, подтверждающий принципиальную не-
прерывность в отношениях между лексикой и грамматикой.
E
Discourse marker tipa became widespread in colloquial Russian in the decade
1990s–2000s. However, until recently, it has gained little attention.
In this paper we use the data from the Russian National Corpus and we aim
to accomplish the following goals: 1) to highlight the origin of the discourse
marker tipa from the noun tip ‘type’, 2) to describe the semantics of the discourse
marker tipa as well as that of the partly grammaticalized element tipa
as part of parametric constructions. We base our approach mainly on the
results achieved by Susanne Fleischman and Marina Yaguello.
F
The problem of spelling correction is crucial for search engines as misspellings
have a negative effect on their performance. It gets even harder when
search queries are related to a specific area not quite covered by standard
spell checkers, such as geographic information systems (GIS). Moreover,
standard spell-checkers are interactive, i.e. they can notice a misspelled
word and suggest candidate corrections, but picking one of them is up to the
user. This is why we decided to develop a spelling correction unit for 2GIS,
a cartographic search company. To do this, we have extracted and manually
annotated a corpus of GIS lookup queries, trained a language model,
performed various experiments to find the best feature extractor, then fitted
a logistic regression using an approach suggested in SpellRuEval, and
then used it iteratively to get a better result. We have then measured the
resulting performance by means of cross-validation, compared at against
two baseline algorithms and observed a substantial increase. We also present
an interpretation of the result achieved by calculating and discussing
the importance of specific features and analyzing the output of the model.
G
The problem of detecting heated arguments in text such as political debates
and customer complaints is formulated as tree kernel learning of discourse
structures. Affective argumentation structure is discovered in the form
of discourse trees extended with edge labels for communicative actions.
Extracted argumentation structures are then encoded as defeasible logic
programs and are subject to dialectical analysis, to establish the validity
of the main claim being communicated. We evaluate the accuracy of each
step of this affect processing pipeline as well as overall performance.
The paper examines dependencies between the syntactic and prosodic
structure with particular attention to the pausation and different levels
of prosodic boundary strength. The research is based on the prosodic data
markup for a spoken Russian text and the manual tagging of this text with
the relevant syntactic constituent boundaries. Two types of structures, the
finite clause and the asyndetic coordination, exhibit a strong positive correlation
with the appearance of a pause and the perceptual prosodic boundary.
We also demonstrate the presence of a substantial correlation between
the syntactic embedding depth and prosodic boundaries. The results of our
research show a significant connection between some of the initially proposed
syntactic factors and prosodic structure. We thus anticipate that
prosodic modules of TTS systems can benefit from taking certain syntactic
information into consideration.
I
The article intends to describe the formal variation of the connectors of the
Russian language on the basis of a cognitive-semantic approach. Every
discourse variant DV of a connector K, i.e. the specific form assumed
by K in a discourse section, is singled out, and registered in the supracorpora
database of connectors (SCDB), in which a system of intersecting
clusters has been developed, allowing to assign in the course of the annotation
the same DV to different structural clusters. In the next phase,
on the base of further semantic analysis, the DVs with a common element
are combined into a structural-semantic complex around a basic form: the
minimal linguistic unit that enables the speaker to express a certain logical-semantic
relation, and the listener to identify it. In conclusion, criteria
for describing the formal variation of the connectors are proposed, as well
as examples of the “profiles” of the basic forms. They reflect the potential
of linguistic means that the speaker has at his disposal to express one or another
logical-semantic relations or one of their combinations.
The paper describes the Russian connective khotya (‘although’) from
a contrastive perspective. First, it focuses on the semantic description
of the connective and proposes to differentiate its four meanings, namely,
concessive propositional, concessive illocutionary, adversative propositional
and adversative illocutionary. The paper analyzes the functioning
of the connective khotya (prototypical marker of concessive relations) and
that of the connective no (‘but’, prototypical marker of adversative relations).
In so doing, it comes to the following conclusion: the adversative
meaning of khotya develops on the basis of its concessive meaning as the
connection between the situations presented in the textual fragments that
are linked by the connective becomes less logical. Similarly, i.e. vice-versa,
as the logical connection between situations becomes stronger, this gives
rise to a concessive interpretation in utterances with no. Further, the paper
takes a closer look at French equivalents khotya gets, when occurring
in each of its four meanings. The concluding section attempts to define the
degree of language-specificity of khotya. To this end, several parameters are considered: (1) cases where the connective has a zero equivalent, (2)
cases of divergent translation (the connective is translated by a non-connective),
(3) number of translation patterns. To perform a contrastive analysis
and to collect statistical data, the supracorpora database of connectives
is used. The database is built upon the parallel Russian-French and FrenchRussian
subcorpora of the RNC.
Статья продолжает серию исследований микросинтаксиса русского
языка, которые автор проводит на протяжении достаточно продолжи-
тельного времени. В центре внимания находится адвербиальная ми-
росинтаксическая единица то и дело, которая представляется весьма
интересной и поучительной, поскольку сочетает в себе ряд имплицит-
ных семантических особенностей и уникальный набор синтаксических
свойств, часть из которых обнаруживается благодаря рассмотрению
не только синхронных, но и диахронных языковых данных. Эта единица
исследуется на фоне других микросинтаксических элементов, кото-
рые оказываются ее соседями по словарю, но обладают существенно
другим набором лингвистически релевантных свойств. Обсуждаются
вопросы, связанные с адекватным представлением фразеологиче-
ских единиц типа то и дело в Микросинтаксическом словаре русского
языка, составляемом автором и его коллегами, и в корпусе текстов,
содержащем микросинтаксическую аннотацию.
This paper addresses the problem of readability assessment for Russian
texts and investigates the impact of 24 lexical, syntactic and frequency features.
The research was conducted on Russian Readability Corpus containing
two sub-corpora, two sets of 5–11 grade level textbooks on Social studies
for native speakers of Russian. The sub-corpora were collected for research
purposes, annotated and marked as BOG and NIK. The application of the
Ridge regression has demonstrated the connection between readability and
average sentence length, average number of coordinating chains, average
number of sub-trees, frequency and lexical features. The results of the study
have the potential to be applied in a wide variety of areas including primarily
education, as well as webpage design, document management.
K
This paper presents corpus-based research of quotation constructions
in Russian Sign Language (RSL). Quotation constructions have been observed
from different perspective in different signed and spoken languages
[Brendel, Meibauer, Steinbach 2011]; [Litvinenko et al. 2009]. Based on the
corpus of spontaneous narratives recorded from RSL signers [Burkova
2015], we conducted a quantitative analysis of these constructions. We analyzed
constituents of quotation construction, such as the source (author
of utterance) indication, the introducing matrix predicate, and the quote.
Our investigation of non-manual markers in the corpus revealed that nonmanual
marking of quotation is optional for RSL quotations. We distinguished
direct and indirect quotations in our data based on the reference
of indexical elements, the use of subordinating conjunction, and the imperative
mood. We found that in RSL non-manuals do not mark the direct/
indirect type of quotation. Our data show that RSL signers tend to use direct
quotation much more frequently than indirect quotation. In addition,
we compared our findings with the data on quotation constructions in some
other sign languages and with the studies of quotation in natural discourse
of spoken languages. This comparison showed that RSL quotations share
core properties with quotations in spoken and signed languages [Litvinenko
et al. 2009].
Although language production and comprehension are parts of one and
the same linguistic capacity, they have been studied separately for a long
time. A key issue in the present day research is how the two processes are
related, and whether transitions from thought to language and vice versa
are accomplished by a single or two separate systems. Important progress
in this area has been achieved in the field of psycho- and neurolinguistics;
a brief review is provided in Section 1. In this paper we explore the production—comprehension
relationship on the basis of our multichannel
resource “Russian Pear Chats and Stories”. In Section 2 we describe this
resource, including the stimulus material, data collection setup, participants
and corpus size, and technical aspects. Section 3 lays out two main
theoretical notions: a model of face-to-face multichannel communication
and a scheme of the production-comprehension interweaving in each interlocutor.
In subsequent sections we discuss three case studies of production—comprehension
relationships: relative contributions of kinetic
channels to discourse understanding (Section 4), turn-taking and eye gaze
(Section 5), and multichannel continuity (Section 6). The evidence of the
multichannel corpus suggests a cognitive architecture that integrates language
production and comprehension.
In the paper we discuss methods used to create CoSyCo, a corpus of syntactic
co-occurrences, which provides information on syntactically related
words in Russian. We describe a list of shallow parsing templates, which
were used to collect data for CoSyCo. The paper includes an overview of the
corpora collected for CoSyCo creation and an outline of how the noun ‘virus’
is used in its subcorpora as an example of the information which can
be obtained from this online resource.
Word-vector representations have been extensively studied for rich resource
languages with large text datasets. However, only a few studies analyze
semantic representations of low resource languages, when only small
corpus is available. In this study we introduce a methodology and compare
techniques to learn semantic representations of low resource languages.
The proposed methodology consists of defining accurate preprocessing
steps, applying language-independent stemmer and learning word-vector
representations. In addition, we propose a simple word embeddings evaluation
scheme that can be easily adapted to any language. By using this
methodology we learn word-vector representations for Buryat language.
In order to promote further research we make the source code and the resulting
word embeddings corpus publicly available.
Topic—focus articulation in Russian has been mainly studied against isolated
utterances. In a categorical sentence, this communicative opposition
is reflected in the linear-accentual structure [Paducheva 2015]. For a simple
declarative sentence, that would normally mean that the topic (theme)
comes first and has a rising phrasal accent, while the focus (rheme) completes
the utterance and is pronounced with a falling accent. At the same
time, these formal features do more than just differentiate between topics
and foci; they also mark the discourse-semantic category of phase [Kodzasov
2009]. In syntactically simple utterances, topics tend to correlate with
anticipated continuation, hence non-final phase; foci are usually phase-final.
As I intend to show in this paper, the non-final phase provides a variety
of contexts that challenge the topic—focus distinction. The study is based
on the “Stories about presents and skiing”—a collection of prosodically annotated
spoken narratives.
In Section 1, I concentrate on issues within a simple clause, where
non-final verbal elements often have a fuzzy communicative interpretation.
In Section 2, I analyze complex syntactic structures. The data show that
non-final clauses may demonstrate both thematic and rhematic properties
with regard to their intonation patterns, internal structure and discourse
function. Hence, one can claim that some non-final clauses are topics, while others are foci. However, a majority of non-final clauses in the analyzed
corpus may not be unambiguously attributed to either of these categories.
Section 3 provides a pilot study of complex intonation patterns. Only
phase distinction being considered, utterances with more than one accentual
phrase may follow either (i) the basic adaptation strategy (comprising
a non-final rising accent and a final falling accent), or, more often, (ii) a complicated
strategy: (a) multiple parallel adaption, (b) consecutive adaptation,
or (c) parenthetical strategy.
Our project aims to design a syntactic parser, which constructs a semantic
representation in a frame format: a clause is represented as a table of valencies,
filled in with semantic markers. This representation is compared to a list
of scripts—used to disambiguate and classify the semantic representation
as well as to select an appropriate reaction for a companion robot F-2.
Thе paper discusses the most important results of the project “Hierarchy
of prosodic phrasing in spoken language: controlling factors and means
of realization”. The project was aimed at expanding the empirical base
of phrasal prosody researches, which inadequacy is marked in many scientific
areas: discourse theory, syntax, intonational phonology, general
phonetics, speech synthesis and recognition etc. The introduction provides
a brief description of the study background and formulates the tasks which
were necessary to solve for the ultimate goal of the project planned for
3 years of implementation. The first section describes the characteristics
of speech corpora created in the the project for construction of a complex,
linguistic-prosodic database required for the study and modeling of prosodic
phrasing in Russian speech, which takes into account, if possible,
all controlling factors and means of realization. The second section is devoted
to the description of the structure and composition of wordbreaks’
discursive features database (BDF), obtained on the basis of annotated,
prosodically graduated and acoustically analyzed speech corpora. It should be noted the universality and flexibility of the format and structure of the
database as a computer resource, freely admitting to extend its feature set
and to detail their parametric characteristics. The third section illustrates
as the BDF application for theoretical and statistical modelling of inter-level
correlations “syntax—linguistic prosody” in both directions and “linguistic
prosody and speech signal (acoustic speech)” in both directions. The conclusion
summarizes the results of research and discusses some promising
directions for further studies on relevant topics.
В работе рассматриваются метатекстовые (вводные) конструкции
с ментальными глаголами во 2-м лице. Показано, что если пропозиции,
ассоциированные с вводными словами 1-го лица (думаю; боюсь; знаю
и т.д.) и 3-го лица (считают и т.п.) принадлежат говорящему и 3-му лицу
соответственно, то пропозиции, ассоциированные с вводными сло-
вами 2-го лица (думаешь, представляешь, знаешь и т.п.), обычно
не принадлежат адресату. Рассматриваются следующие вопросы:
есть ли семантическая корреляция между пропозицией и МК, какую
иллокутивную функцию имеют МК и пропозиция. Было показано, что
некоторые МК употребляются только в вопросительных предложениях.
The paper reports our participation in the shared task on word sense induction
and disambiguation for the Russian language (RUSSE’2018). Our team
was ranked 2nd for the wiki-wiki dataset (containing mostly homonyms) and
5th for the bts-rnc and active-dict datasets (containing mostly polysemous
words) among all 19 participants.
The method we employed was extremely naive. It implied representing
contexts of ambiguous words as averaged word embedding vectors, using
off-the-shelf pre-trained distributional models. Then, these vector representations
were clustered with mainstream clustering techniques, thus producing
the groups corresponding to the ambiguous word’ senses. As a side result,
we show that word embedding models trained on small but balanced
corpora can be superior to those trained on large but noisy data—not only
in intrinsic evaluation, but also in downstream tasks like word sense induction.
L
This paper presents an outline of the readability assessment system construction
for the purposes of the Russian language learning. The system
is designed to help educators easily obtain the information about the difficulty
level of reading materials. The estimation task is posed here as a regression
problem on data set of 600 texts and a range of lexico-semantic
and morphological features. The scale choice and annotated text collection
issues are also discussed. Finally, we present the results of the experiment
with learners of Russian as a foreign language to evaluate the quality
of a predictive model.
Many words that according to the dictionaries have just one meaning are
in fact understood in different ways by different speakers. In this article
we deal with Russian nouns denoting everyday life objects which are subject
to much variation by age, gender, and region and are poorly described by the
existing dictionaries. We report the results of a multilevel survey, propose
some possible metrics of word knowledge and show to what extent the words
we studied are known among a certain population. We also claim that different
speakers possess different sets of meanings for each word, propose ways
to discover the distribution patterns for these sets and introduce the notion
of disperse polysemy. We believe that our findings may be useful in lexicography
(providing detailed information on current word usage in different social
groups), lexical semantics (researching meaning shifts and patterns of its
distribution among speakers), and language testing (more precise detection
of the vocabulary sizes both in native speakers and in language learners).
The paper deals with the Russian aby as a marker of “free choice” (or, rather,
not specified choice criteria) within indefinite pronouns against the background
of other markers of “free choice” such as ugodno, popalo, pridetsia.
It pays attention not only to the synchronic semantics of aby, but also to its
history and claims that the modern meaning of aby is related to its usage
as a conjunction. The paper makes use of the corpus data (the Russian National
Corpus as well as the Internet data) to follow the changes in the use
of the particle in question over the last two hundred years. It investigates
into the range of K-words that can collocate with aby: the most typical are
collocations with kto, chto, kak and kakoi; however, collocations with other
K-words are also present in the corpora. In addition, it discusses the question
of negative polarity of aby and the increasing degree of its polarization.
The paper deals with the Russian interjections (oj, oh, aj, ogo, uh, etc.),
namely their non-canonical use in collocations with K-words (Wh-words),
mostly kak and kakoj. This type of use demonstrates a sort of syntactic recomposition
— collocations oj kak, oh kakoj, etc. function as lexical units
with the meaning of high degree, high quality or big quantity, although with
very specific semantic shades. The paper makes use of the corpus data (the
Russian National Corpus as well as the Internet data) to discover individual
properties of interjections and their historical changes. Primary interjections
are described against the background of interjections derived from
the words of different part of speech. It turns out that in non-canonical use
of primary interjections K-word can hardly be omitted, whereas derived
interjections can also function the same way even without K-word. Noncanonical
use of derived interjections is, with and without K-words, is very
popular in contemporary Russian, especially in slang.
The paper describes an experiment on an instrumental evaluation of the intonation
quality of synthesized Russian speech by using of “Inton@Trainer”
computer system. The system was originally designed to train learners
in producing the basic intonation patterns of Russian speech. It is based
on comparing the melodic portraits of a reference sentence and a sentence
pronounced by the learner. Our approach to assessing the intonational
quality of speech allows to treat a synthesized speech with the same
strict requirements as are applied to students studying Russian as a second
language. We describe the technology used for the instrumental evaluation
of the intonation quality of synthesized speech and the acoustic database
of reference phrases used to assess the intonation quality of synthesized
speech. The paper presents the results of testing the intonation quality
of two Russian synthetic voices. We discuss the results of the experiment
and outline the ways for improving the methods for objective evaluation
of synthesized speech prosodic quality, as well as the possibility of applying
the developed system in other linguistic tasks.
In this paper we present the RuSentRel corpus including analytical texts
in the sphere of international relations. For each document we annotated
sentiments from the author to mentioned named entities, and sentiments
of relations between mentioned entities. In the current experiments, we considered
the problem of extracting sentiment relations between entities for
the whole documents as a three-class machine learning task. We experimented
with conventional machine-learning methods (Naive Bayes, SVM,
Random Forest).
The paper explores the distribution and interpretation of the discourse
marker po(-)xodu (PX) and addresses a possible path of its diachronic
development. We argue that the range of uses of PX attested in the corpora
supports an analysis that identifies three meanings / functions of this
item labeled eventive PX, epistemic PX and discourse-level PX throughout
this paper. We propose that the latter two are the products of re-interpretation
of the former. We argue for a presuppositional analysis of the eventive
PX whereby it requires there be a set of background events that show
a temporal overlap with the asserted event and add up to the integral whole.
We analyze the epistemic PX as resulting from inferential reinterpretation
of the relationship between background and asserted events, with the abductive
reasoning being the key ingredient of this reinterpretation. Finally,
we treat the discourse-level PX as a counterpart of the eventive PX in the domain
of speech acts. We speculate that Krifka’s (2014) recent view of speech
acts as index changers opens a way of accounting for this parallelism
in a principled way. On the diachronic side, we identify PX as the product
of diachronic development of the construction in which the argument of the
noun xod ‘move’ is expressed by an overt DP. In the course of development, this DP was first replaced by pro, which gave rise to the eventive PX, and
later on developed epistemic and discourse-level meanings / functions.
M
Nowadays a new yet powerful tool for drug repurposing and hypothesis
generation emerged. Text mining of different domains like scientific libraries
or social media has proven to be reliable in that application. One particular
task in that area is medical concept normalization, i.e. mapping a disease
mention to a concept in a controlled vocabulary, like Unified Medical Language
System (UMLS). This task is challenging due to the differences in language
of health care professionals and social media users. To bridge this
gap, we developed end-to-end architectures based on bidirectional Long
Short-Term Memory and Gated Recurrent Units. In addition, we combined
an attention mechanism with our model. We have done an exploratory study
on hyperparameters of proposed architectures and compared them with the
effective baseline for classification based on convolutional neural networks.
A qualitative examination of the mentions in user reviews dataset collected
from popular online health information platforms as well as quantitative one
both show improvements in the semantic representation of health-related
expressions in user reviews about drugs.
Being a matter of cognition, user interests should be apt to classification
independent of the language of users, social network and the essence of interest
itself. To prove it, we built a collection of English and Russian Twitter
and Vkontakte community pages manually classified according to the
interests of their followers. First, we created a model of Major Interests
(MaIs) with the help of expert analysis and then classified the mentioned set
of pages using machine learning algorithms (SVM, Neural Network, Naive
Bayes, Logistic Regression, Decision Trees, k-Nearest Neighbors) trying
different optimization techniques. We take three interest domains that are
typical of both English and Russian-speaking communities: football, rock
music, vegetarianism. The results of classification show a greater correlation
between Russian-Twitter and English-Twitter pages. The Logistic Regression
with Bernoulli bag-of-words model proves to be the most effective
classification algorithm.
N
In this paper, we decribe the coreference annotation on a multi-lingual parallel
treebank (PAWS), a portion of Wall Street Journal translated into Czech,
Russian and Polish which continues the tradition of multilingual treebanks
with coreference annotation. The paper focuses on language-specific differences.
We analyse syntactic structures concerning anaphoric relations
in the languages under analysis, such as personal and impersonal constructions
in polypredicative constructions and pro-drop qualities.
The paper presents a contrastive analysis of pronominal adverbs in German
(dabei, darauf, damit etc.) and their equivalents in English, Czech and Russian.
The analysis is based on an empirical study of parallel news texts. Our
main focus is to show the interplay between cohesive devices expressed
through German pronominal adverbs in text and explore their equivalents
in English, Czech and Russian. As the dataset at hand contains translations,
we also focus on the influence of the translation factor in parallel texts.
P
В докладе речь идет о снятой утвердительности (suspended assertion).
Показано, что термин снятая утвердительность, который был введен
в 1963 году У. Вейнрейхом, охватывает тот же круг явлений, что тер-
мин nonveridicality (предлагаемый перевод на русский язык — неве-
ридикативность), который получил широкое распространение в лите-
ратуре по формальной семантике благодаря работам А. Джаннакиду,
Ф. Зварца и др.. Рассматриваются факты русского языка, требующие
обращения к понятию снятая утвердительность: местоимения типа
какой-нибудь, местоимения отрицательной поляризации, исчезнове-
ние семантического актанта у глаголов в прямой (не параметрической)
диатезе, зеркальная симметрия прошедшего и будущего, отрицание
с расширенной сферой действия, местоимения на -нибудь в сфере
действия отрицания, взаимозаменимость еще и уже. Высказывается
убеждение, что понятие снятой утвердительности будет применяться
и в других контекстах.
The paper describes the results of the first shared task on word sense induction
(WSI) for the Russian language. While similar shared tasks were conducted in the
past for some Romance and Germanic languages, we explore the performance
of sense induction and disambiguation methods for a Slavic language that shares
many features with other Slavic languages, such as rich morphology and virtually
free word order. The participants were asked to group contexts of a given word
in accordance with its senses that were not provided beforehand. For instance,
given a word “bank” and a set of contexts for this word, e.g. “bank is a financial
institution that accepts deposits” and “river bank is a slope beside a body of water”,
a participant was asked to cluster such contexts in the unknown in advance
number of clusters corresponding to, in this case, the “company” and the “area”
senses of the word “bank”. For the purpose of this evaluation campaign, we developed
three new evaluation datasets based on sense inventories that have
different sense granularity. The contexts in these datasets were sampled from
texts of Wikipedia, the academic corpus of Russian, and an explanatory dictionary
of Russian. Overall, 18 teams participated in the competition submitting 383
models. Multiple teams managed to substantially outperform competitive stateof-the-art
baselines from the previous years based on sense embeddings.
В статье рассматривается иллокутивное употребление союзов, при ко-
тором союз связывает пропозицию одной клаузы с иллокутивной мо-
дальностью другой. Обосновывается шкалярный подход к интерпре-
тации этого явления: наряду с бесспорно иллокутивным и бесспорно
неиллокутивным употреблением, существует класс конструкций
с промежуточными свойствами. Формулируются критерии разграни-
чения степеней иллокутивности. Демонстрируется, в частности, что
императивные предложения, в отличие от вопросительных, не бывают
бесспорно иллокутивными. Предъявляются свидетельства того, что
предлагаемый подход находит подтверждение в грамматике: разные
союзы совместимы с разными видами иллокутивного употребления;
в составе бесспорно иллокутивных конструкций не употребляется кор-
релят тогда.
The current paper deals with the integration of the Japanese language
in a multilingual NLP model, namely, the Compreno model. The formalism
includes morphological, syntactic and semantic patterns, covering all possible
semantic and syntactic dependencies a word can attach. The architecture
of the model allows us to acquire nearly all semantic links of a word
through its proper positioning in a thesaurus-like semantic hierarchy, where
words are linked through semantic dependencies. The inheritance principle
of the hierarchy simplifies the syntactic description of a newly added language
as well. Unlike the traditional approach to Japanese parsing based
on chunks, or bunsetsus, we suggest a Japanese parser based on constituents.
Special attention is given to the tools that allow us to automatize
language description process and significantly speed up the description.
The work on the Japanese model is still in progress, therefore, we show
the current results we have achieved, and point out problems that remain
to be solved.
This paper studies the impact corpus size has on the robustness of various
frequency-based measures of corpus distance (or similarity, respectively),
such as Euclidean distance, Manhattan distance, Cosine distance,
χ², Spearman’s ρ, and Simple-Maths Keyword distance. An experiment
performed using the British National Corpus shows that Euclidean distance
is least influenced by corpus size and thus is best suited for the purpose
of comparing corpora.
The paper focuses on Russian constructions with clauses (or VPs) combined
by means of the discourse marker A, that behaves as a conjunction
or as a particle in different contexts. Prosodically, the construction may
come up in two forms: (a) as a single illocution with the first clause pronounced
with a rising pitch that projects discourse continuation, and (b)
as two separate illocutions with the first clause pronounced with a falling
pitch that projects no continuation. Basing on the data from the Prosodically
Annotated Corpus of Spoken Russian, prosody and grammar of (a)
and (b) were analyzed qualitatively and quantitatively. Type (b) appeared
to be as frequent as type (a) and systematically favored in pragmatically
marked contexts.
R
This paper describes a practical solution for the task of referring expressions
generation (REG) in the context of a question-answering system.
When an answer to a question is found in the knowledge base the system has
to decide how to present the answer to the user, which properties uniquely
distinguish the object found from other objects in the knowledge base.
Another task where referring expressions would be useful is the semantic
graph visualization task. Building on top of the graph-based approach
presented by Krahmer et al in 2003 this paper provides some practical improvements
to the algorithm, namely: 1) Instead of depth-first graph search
we use breadth-first search, which is dramatically faster when a scene
graph is big but the description graph to be found is small, 2) Limit on the
size (the number of edges) of the resulting description graph to increase
performance and avoid useless long descriptions. Also a sketch on linguistic
realization of the referring expressions is outlined.
S
Исследование структуры повседневного диалога проведено на матери-
але 73 микродиалогов повседневной речевой коммуникации из корпуса
устной русской речи «Один речевой день» (ОРД корпус). Задачей ис-
следования было выяснение того, какие типы речевых актов чаще всего
инициируют и завершают диалог, а также выявление наиболее типичных
последовательностей речевых актов в структуре диалога. Была про-
анализирована речь 30 человек (6 информантов и 24 коммуникантов)
в объеме 2230 речевых актов, относящихся как к профессиональным,
так и бытовым разговорам. Для подсчета наиболее частотных после-
довательностей речевых актов использовалась техника n-граммного
анализа. Полученные результаты показали, что инициируют диалог
чаще всего репрезентативы, т.е. речевые акты, связанные с обменом
информацией (38% случаев), «этикетное» начало (приветствия, вока-
тивы) имеет место в 23% диалогов, а в 19% случаев разговор начина-
ется с регулятивной формы. Речевые акты, завершающие диалог, по-
казывают большее разнообразие: это репрезентативы (16% случаев),
оценочные суждения (валюативы) (14%), регулятивные формы (14%),
по 8% — директивы, комиссивы и этикетные формы и 7% — экспрес-
сивы. Наиболее типичными бинарными последовательностями речевых
актов оказались: два репрезентатива подряд (22,35%), регулятивная
форма и следующий за ней репрезентатив (6,93%), репрезентатив и ре-
гулятивная форма (6,0%), валюатив и следующий за ним репрезентатив
(5,21%), репрезентатив и оценочное суждение (4,77%), а также двусто-
ронняя комбинация директива с репрезентативом (по 2,77%).
Probabilistic topic modeling is a powerful tool of text analysis, that reveals
topics as distributions over words and then softly assigns documents to the
topics. Even though the aggregated distributions can be good with basic
models, a sequential topic representation of each document is often unsatisfactory.
This work introduces a method that allows to increase the quality
of topical representation of each single text using its segmental structure.
Our approach is based on Additive Regularization of Topic Models (ARTM),
which is a technique for imposing additional criteria into the model. The proposed
method efficiently avoids a bag-of-words assumption by considering
the topical connections of words that co-occur in a local segment. We assume,
that sequential sentences are topically and semantically coherent,
while the number of topics in each particular text fragment is low. We apply
our model to topic segmentation task and achieve a better quality than
the current state-of-the-art TopicTiling algorithm. In further experiments
we demonstrate that the proposed technique reveals an interpretable sequential
structure of documents, while keeping a number of topics low, i.e.
the sparsity of the model increases. Apart from topic segmentation, the
constructed topical text embeddings can be used in any other applications,
where the analysis of the document structure is desirable.
In this paper we introduce RusDraCor—an open corpus of Russian drama
for digital literary & linguistic research. The corpus (rus.dracor.org) contains
plays from the middle of XVIII to the first third of XX century provided
with structural (plus some semantic) markup and metadata. Texts are encoded
in the XML-based standard TEI, widely used in building corpora for
the humanities. We describe the contents and annotation layers of our corpus,
provide some details on its development and enrichment, and finally
describe three research cases. Each case demonstrates the use of RusDraCor
to answer specific questions about composition, structural features
and historical evolution of Russian drama.
Данная работа продолжает уже ставшую традиционной для конферен-
ций «Диалог» проблематику исследования речевых сбоев (см., в част-
ности, работы Подлесская, Комарова 2010; Лауринавичюте, Федорова
2010; Подлесская 2013; Богданова-Бегларян 2013; Подлесская 2014;
Потанина и др. 2016). В настоящей статье этот вопрос будет рас-
смотрен при сравнении языкового поведения русскоязычных детей
10–12 лет (раздел 1) со взрослыми носителями языка на материале
корпуса танграмм (раздел 2). В разделе 3 будет приведена класси-
фикация речевых сбоев, в разделе 4 приведены результаты исследо-
вания. Наконец, раздел 5 будет посвящен обсуждению результатов
и перспективам дальнейшей работы. Мы покажем, что дискурсивное
поведение ребенка 10–12 лет с точки зрения речевых сбоев отличается
от аналогичного поведения взрослых носителей, что подтверждает
нашу гипотезу о позднем дискурсивном развитии ребенка.
Every adult native speaker of Russian knows that kon’ is masculine and
lan’ is feminine, although 3rd declension nouns present some difficulties
in the first and second language acquisition. However, will the fact that
these nouns are less frequent than masculine nouns ending in a consonant
or feminine nouns ending in -a/ja play a role for online subject-predicate
agreement processing? Or will subject-predicate agreement processing
be more problematic with subjects of a certain gender? Finally, some final
consonants are more characteristic for feminine gender, while the others
for masculine gender. Are speakers sensitive to this? We present two experiments
addressing these questions. We found that all three factors play
a role, but for different tasks (online agreement processing or determining
the gender of a novel word) and at different processing stages.
We offer a new neural architecture for character-level morphological tagging,
combining character-level networks with the output of neural language
model on morhological tags. Our proposal reduces tagging error
up to 10% in comparison with baseline model and achieves state-of-the-art
performance both on ru_syntagrus and MorphoRuEval datasets.
The paper deals with differential object marking in the Russian Speech
of Nanai-Russian bilingual speakers, namely the variation such as принес
рыбу ~ принес рыба (‘{he} brought fish-acc ~ fish-nom’). The puzzle is that
this peculiarity can result from a number of different processes: morphosyntactic
borrowing from Nanai, penetration of dialectal features into the
speech of bilinguals, under-acquisition or reinterpretation of the Standard
Russian system. The data of a small corpus of contact-influenced Russian
Speech is used to test all these hypotheses. The results are following. Nominative
forms are used in DO-position in quite a systematic way and such uses
cannot be estimated as occasional “errors”. The main factors that influence
the NOM~ACC distribution are a) information structure and b) the accentual
type of noun stem. The latter fact supports the hypothesis of a systematic
reinterpetation of the Standard Russian system in the situation of incomplete
acquisition. No significant correlations with animacy, definiteness, verb form
and word order were attested. DOM pattern of Nanai Russian differs from
those of Russian dialects and reveals some similarity to those of Nanai. However
it cannot be considered as a full morphosyntactic calque.
T
В статье предпринимается попытка корпусного анализа семантики
русских личных и притяжательных местоимений в интенсиональных
контекстах (на примере контекстов контрфактического тождества).
Задача исследования состояла в том, чтобы определить, способны ли
местоимения различных типов интерпретироваться de se или de re в та-
ких контекстах и какая из интерпретаций предпочтительна.
Контекстами контрфактического тождества называются синтак-
сические позиции, находящиеся в сфере действия модификатора или
клаузы, вводящей ирреальное условие, касающееся тождества тех или
иных нетождественных в действительности индивидов (ср. на твоём
месте, англ. if I were you). В таких контекстах местоимение может обо-
значать реальную личность (как в Я бы на их месте таких должников,
как я, в хвост и гриву гоняла; de re) или же ирреальную (Я бы на их ме-
сте поставил парочку шалашиков в любом приглянувшемся мне месте;
de se — тот, с чьей точки зрения рассматривается ирреальная ситуация).
На материале ГИКРЯ (около 20 млрд словоупотреблений) мы пока-
зываем, что местоимения я и мой допускают как интерпретацию de re,
так и интерпретацию de se, но первая предпочтительна; что возврат-
ное местоимение себя также допускает обе интерпретации, но пред-
почтительнее de se; что возвратное притяжательное местоимение
свой безысключительно интерпретируется de se. Кроме того, сде-
ланы некоторые квалитативные наблюдения, касающиеся идентифи-
кации атомарного индивида с множественным, как в я бы на вашем
месте не стала морочить себе голову, вы молодые люди у вас ещё всё
впереди.
The purpose of the paper is to investigate cues signalling the relations between
discourse units in Russian. Building a lexicon of discourse connectives
is an indispensable subtask in many discourse parsing applications
as well as an essential issue in theoretical researches of text coherence.
In order to develop such a resource for Russian, we have conducted a corpus-based
study of discourse connectives that were manually extracted
from the Russian Rhetorical Structure Treebank (Ru-RSTreebank). The Treebank
includes 79 texts annotated within the RST framework [Mann, Thompson
1988]. In order to provide a deeper analysis of connectives in Russian,
we focus on causal relations only, namely, the ‘Cause-Effect’ relation. Some
of the connectives (primary connectives) are enumerated in grammars and
dictionaries. They primarily mark the intra-sentential relations. However,
there is an expansive class of less grammaticalized items (secondary connectives)
that have received less attention till now. Some of them are based
on content words (e.g. по причине ‘for the cause’). Secondary connectives
often serve as linking devices for inter-sentential relations.
We suggest a scheme for connectives annotation for Russian. We specify
the basic patterns that can be used for less-grammaticalized connectives
mining in an unannotated corpus. Besides, we provide the comparison
of two classes of connectives (primary vs. secondary ones). Our research
has shown that these two classes differ in their properties. There is a statistically
significant difference between them with respect to the nucleus/
satellite position, intra- vs. inter-sentential relations and some others.
U
The subject of this paper are Russian so called adverbial prepositions; cf. vokrug (kostra) ‘around smth.’, daleko ot (doma) ‘far from smth.’, etc. By definition, an adverbial preposition either coincides with an adverb (cf. vokrug) or contains an adverb and a preposition (cf. daleko ot). As I have demonstrated in my previous works, an adverbial preposition and the underlying adverb have the same meaning, the only difference between them being in the mode of expression of the main semantic actant; cf. Gorel koster, vokrug (preposition) kostra stojali liudi ‘A fire was burning, people were standing around it’ vs. Gorel koster, vokrug (adverb) stojali liudi ‘A fire was burning, people were standing around’. From the modern point of view, syntactic distinction is insufficient for interpreting such cases as different words (or different meanings of a word). So, an adverbial preposition and the underlying adverb should be interpreted as the same meaning of a given word. I argue that this word is an adverb (or a prepositional adverb). This paper deals with syntax of these adverbs. Such adverbs have one or more semantic actants, at least one of them being expressed by a noun or a prepositional group. The problem is that in some cases it is not clear whether the prepositional group is governed by the adverb or by the verb governing this adverb (thus the adverb and the prepositional group are co-governed by the verb). A criterion of adverb vs. verb governing of such groups is discussed. Two Russian adverbs zadolgo ‘for a long time before smth.’ and nezadolgo ‘for a long time before smth.’ are described from this point of view.
V
В статье рассматриваются коррелятивные тавтологические конструк-
ции вида что будет, то (и) будет, где придаточное предложение пред-
шествует главному, а содержание обеих частей материально совпа-
дает. При анализе материала из Национального корпуса русского
языка и интернет-источников обнаруживается ряд нетривиальных
особенностей, присущих данным конструкциям. Так, некоторые тавто-
логии в разных контекстах передают противоположные значения: что
было, то было может интерпретироваться и как ‘то, что это действи-
тельно было, нельзя отрицать’ [Булыгина, Шмелев 1997], и как готов-
ность забыть о прошлом в интересах будущего [Активный словарь
русского языка]. Далее, частица и в главном предложении допустима
в одних тавтологиях, но неприемлема в других. В работе предлага-
ется объяснение указанным фактам путем выделения четырех воз-
можных значений на основании двух оппозиций: (а) находится ли опи-
сываемая ситуация в фокусе внимания говорящего или выводится
из него; (б) является ли прочтение конструкции генерическим или
конкретно-референтным.
Y
One of the means of designating the coherence in the spoken discourse
is demonstrating that the current utterance of the discourse is not terminal.
Every step of narrative consisting of the chain of statements can be marked
as non-final. The prosodic cues for incompleteness applied to the speech
act of a statement have been studied in details in linguistic literature. In this
paper, the discourse incompleteness is analyzed as composed not only
with statements but with questions, imperatives, and vocatives as well. The
results of the investigation are as follows. The wh-questions, imperatives,
and vocatives can be freely composed with the meaning of discourse continuity,
and they have specific prosodic cues for marking this combination
of meanings. Whereas the yes-no-questions do not accept the prosodic incompleteness
marking. The prosodic patterns of incompleteness and the
accent placement in questions, vocatives, and imperatives are exemplified
here by the dialogues taken from the Multimodal corpus of the Russian National
corpus, the Prosodically Annotated Corpus of Spoken Russian (spokencorpora.ru),
and the minor working collection of the Russian speech
recordings specifically set up for this investigation. The software program
Praat was used in the process of analyzing the sounding data.
Z
В докладе предлагается семантический анализ русского неопределен-
ного наречия как-нибудь, проведенный на основе анализа данных фран-
цузского, итальянского и английского параллельных подкорпусов НКРЯ,
а также базы данных русских дискурсивных слов и их французских эк-
вивалентов. В исследовании применяется унидирекциональный метод
контрастивного анализа, при котором использованный профессиональ-
ным переводчиком способ передачи смысла анализируемой единицы
текста оригинала рассматривается как ее квазитолкование, обнаружи-
вающее возможные имплицитные компоненты ее значения. Проведен-
ное исследование позволило подтвердить высокую степень лингвоспе-
цифичности данного слова (обнаруживающую себя, с одной стороны,
в значительной доле нулевых эквивалентов — как среди «моделей», так
и среди «стимулов» перевода — а также в наличии широкого спектра
различных «моделей» и «стимулов» перевода). При этом у слова как-
нибудь было выявлено значение «маркера неконтролируемости», в ряде
контекстов функционально сходное с конъюнктивом в романских языках,
которое не зафиксировано толковыми и двуязычными словарями; с дру-
гой стороны, было обнаружено, что чисто оценочное значение ‘кое-как,
плохо’ в современном языке значительно сузило свою сферу употребле-
ния по сравнению с 19-м веком и реализуется преимущественно одно-
временно с основным значением неопределенности образа действия.
This paper is addressed the problem of parametric variation in Russian
grammar, with focus on copular constructions with agreeing and nonagreeing
adjectival predicates. Basing on Russian National Corpus, I reconstruct
two dialects of Russian morphosyntax. They differ regarding the
assignment of the predicative instrumental case, raising conditions and
the distribution of agreeing vs non-agreeing predicates after быть 'be',
стать 'become' and казаться 'seem'. Russian-A only licenses predicative
instrumental on adjectives after SEEM (казалось странным, что P)
and non-agreeing predicatives after non-zero forms of BE or BECOME
(было странно, что P). Russian-B allows non-agreeing forms after SEEM
(казалось странно, что P) and forms of the predicative instrumental case
after non-zero forms of BE and BECOME (было странным, что P). I argue
that the differences between Russian-A and Russian-B must explained
in terms of parametric settings and claim that Russian predicatives lack
forms of the predicative instrumental. The assignment of the predicative
instrumental to adjectival heads can be explained as subject control in all
dialects, but only Russian-B allows raising of sententional arguments to the
position of the matrix subject.
В статье описывается разрабатываемая архитектура для моделиро-
вания естественного коммуникативного поведения на роботе Ф-2.
Важной частью нашей работы является корпусное исследование ком-
муникативного поведения человека и последующий перенос такого
поведения на робота. Основываясь на мультимодальном корпусе REC,
мы описываем особенности естественной коммуникации, а также
разрабатываем архитектуру, которая учитывает такие особенности.
В данной архитектуре робот может по-разному выражать какую-либо
коммуникативную функцию, используя один или несколько исполни-
тельных органов: например, демонстрировать апелляцию с помощью
мимики, движений головы или жестов рук. Разработанная архитектура
также позволяет гибко комбинировать жесты с разными коммуника-
тивными функциями. Архитектура позволяет с помощью режимов split,
join и single комбинировать теги из разных BML-пакетов, а также син-
хронизировать теги внутри одного пакета BML. Перечисленные осо-
бенности являются ключевыми для формирования правдоподобного
поведения робота Ф-2 и необходимы для повышения эффективности
коммуникации между роботом и пользователем.