Modern language models have extensive information about the compatibility and meanings of various words.
One of the ways to represent such lexical information, which is presented in the present study, is the construction
of semantic sketches.
This paper presents a solution to the task of predicting a predicate from its most frequent actants and sirconstants
using the application of the BERT neural network, which showed the best quality metrics in the Dialogue Evaluation
SemSketches competition. The study analyzed several solutions approaching this task and ways to improve them
based on the peculiarities of the architecture and the nature of data in terms of linguistics.
The results of testing the selected methods showed that the most successful tool for determining the semantic
sketch of a predicate is the Conversational RuBERT model combined with the search for synonyms of the verbs
sought in the training data.
Other promising ways to improve the quality of mapping the predicate to its semantic sketch include the use of
contextualized embeddings to be able to take context into account, as well as fine-tuning of the models used.
Proceedings 2021
Contents (SCOPUS)
Format PDF (SCOPUS)
A
In this paper, we describe a way to perform span normalization as a sequence labelling task. Our model predicts
the modifications that should be applied to the span tokens to normalize them. This prediction is performed via sequence labelling, which means that each token is normalized independently. Despite the simplicity of the approach,
we show that it can lead to the stateoftheart results. We compare different pretraining schemas in application to
this task. We show that the best quality can be achieved when the normalizer is trained on top of a BERTbased
morphosyntactic parser’s representations. Moreover, we propose some additional features useful in the task and
prove that auxiliary morphosyntactic losses can help the model. Furthermore, we show that the model compares
favourably with other contestant models of the RuNormAS competition.
In this paper, we describe our solution of the Lexical Semantic Change Detection (LSCD) problem. It is based
on a WordinContext (WiC) model detecting whether two occurrences of a particular word carry the same meaning.
We propose and compare several WiC architectures and training schemes, and also different ways to convert WiC
predictions into final word scores estimating the degree of semantic change.
We participated in the RuShiftEval LSCD competition for the Russian language, where our model achieved
2nd best result during the competition. During postevaluation experiments we improved the WiC model and managed to outperform the best system. An important part of this paper is detailed error analysis where we study the discrepancies between WiC predictions and human annotations and their effect on the LSCD results.
In this paper we propose a new Word Sense Induction (WSI) method and apply it to construct a solution for the
RuShiftEval shared task on Lexical Semantic Change Detection (LSCD) for the Russian language. Our WSI algorithm based on lexical substitution achieves stateoftheart performance for the Russian language on the RUSSE2018 dataset. However, our LSCD system based on it has shown poor performance in the shared task. We have studied mathematical properties of the COMPARE score employed in the task for measuring the degree of semantic change, as well as the discrepancies between this score and our WSI predictions. We have found that our method can detect those aspects of semantic change, which the COMPARE metric is not sensitive to, such as appearance or disappearance of a rare word sense. An important property of our method is its interpretability, which we exploit to perform the detailed error analysis.
B
The paper presents a novel method for near-duplicate detection in handwritten document collections of school
essays. A large amount of online resources with available academic essays currently makes it possible to cheat
and reuse them during high school final exams. Despite the importance of the problem, at the moment there is
no automatic method for near-duplicate detection for handwritten documents, such as school essays. The school
essay is represented as a sequence of scanned images of handwritten essay text. Despite advances in recognition
of handwritten printed text, the use of these methods for the current task is a challenge. The proposed method of
near-duplicate detection does not require detailed markup text, which makes it possible to use it in a large number
of tasks related to the information extraction in zero-shot regime, i.e. without any specific resources written in the
processed language. The paper presents a method based on series analysis. The image is segmented into words.
The text is characterized by a sequence of features, which are invariant to the author’s writing style: normalized
lengths of the segmented words. These features can be used for both handwritten and machine-readable texts. The
computational experiment is conducted on IAM dataset of English handwritten texts and the dataset of real images
of handwritten school essays.
The paper suggests one of the ways to formally define the degree of idiomaticity of a given text. Text idiomaticity
is understood as the density of the use of idioms per text unit. The assessment of the degree of idiomaticity is carried
out in the proposed approach as the ratio of the total number of idioms to the volume of the text in which they met.
The conducted corpus experiment allows us to conclude that the degree of idiomaticity of the most important representatives of the prose of the second half of the 19th century varies significantly. Thus, the degree of idiomaticity of
the text turns out to be an essential factor of the individual style.
The paper presents the results of a corpus study of the order of direct and indirect objects in ditransitive constructions in Russian (like Petya dal Mashe yabloko ‘Petya gave Masha an apple’ or Petya dal yabloko Mashe ‘Petya
gave an apple to Masha’). This topic has been widely discussed in the literature, but previous hypotheses have been
based on individual examples and have never been tested on corpus data. Based on earlier research, we have selected parameters that affect the order of the objects, such as the length, depth, animacy and role of individual verbs
and statistically tested their real effect on two subsamples: with a dative indirect object and with a prepositional
one.
В статье подводятся итоги многолетнего проекта «Языки Русских Городов» (ЯРГ) по сбору и исследованию
региональной лексики, который, к сожалению, не был «финализирован» по ряду причин в виде академических
публикаций. Был собран и систематизирован значительный (ок. 4 тыс. единиц) региональный материал, на
базе которого рассматривается типология региональных различий, вводится/обсуждается понятие региональной нормы. Особое внимание уделяется вопросам надежности и методикам компьютерных региональных
корпусных исследований, включая автоматическую классификацию и текстов и профилирование авторов.
Вместе с этой публикацией возвращается в фонд открытых лексикографических ресурсов и «реинкарнация»
проекта ЯРГ – теперь на базе объединенного портала для дифференциальных социолингвистических исследований, включающего интернет-корпус ГИКРЯ и интерактивный словарь ЯГеЛь (Языки Городов и Людей).
This study raises the problem of the difference between normal and forced (deep) speech breathing. The aim of
this work was to study the intonational-pausal segmentation of speech in normal and forced breathing after physical
activity. The results of the study show that in the process of reading, the structure of the text determines the organization of breathing, and the breathing rate and respiration depth have an impact on the intonational-pausal segmentation
of speech, as well as on the duration and quantity of intonation pauses.
This paper aims to show the results of a quantitative study on verbal aspect in modern Russian. Adopting a
corpus-based approach, we investigate the phenomenon known as ‘aspectual competition’, which can take place when
the imperfective aspect (ipf) is used instead of perfective to designate a single and complete event in the past. In
particular, we investigate the interaction between the choice of aspect and co-textual factors in overlapping situations.
In this study the attention is focused on one aspectual pair, namely pokupat’ipf - kupit’pf, ‘to buy’. The work consists
of two parts: in Phase 1 data were collected from the spoken subcorpus of the Russian National Corpus and the webcorpus RuTenTen11, annotated for several morpho-syntactic factors, and then examined. In Phase 2 a questionnaire
was submitted to native speakers in order to collect more empirical evidence on aspect choice and verify the results
obtained from the corpus study. In both phases, statistical methods were used to analyse the data. Results show that
the aspect of the target verb mainly interacts with two factors: the presence of a contiguous verbs in the linguistic
context and the presence of an object modifier.
The article summarizes the results of a large research project dedicated to investigation of pragmatic markers
(PM) in Russian everyday speech. Pragmatic markers are essential in spontaneous spoken discourse; thus, the quantitative data on their usage are necessary for solving both theoretical and practical issues related to the study of spoken communication. New results were obtained on the data of two speech corpora: “One Day of Speech” (ORD;
mostly dialogues; the annotated subcorpus contains 321 504 tokens) and “Balanced Annotated Text Library” (SAT;
monologues; the annotated subcorpus includes 50 128 tokens). Statistical data were calculated for PM in dialogic
and monologic speech, pragmatic markers common in both types of speech (e. g., hesitative markers like vot, tam,
tak) are identified, as well as PM that are the most typical for monologues (e. g., boundary markers like znachit, nu,
vot, vs’o) or dialogue (e. g., ‘xeno’-markers such as takoi, grit and metacommunicative markers like vidish’, (ja) ne
znaju). Special attention is given to the pragmatic markers usage in different communicative situations.
Research in semantics is actively conducted both in theoretical and computational linguistics, but the formulation
of tasks, objectives and results of semantic research in the two communities are usually largely different. As a step
towards reducing this gap and increasing the awareness of theoretical linguists about what computational linguists are
doing, we examine meaning representation approaches in computational linguistics and contrast them with how this
is done within one of the best-known theoretical approaches – the Meaning ⇔Text Theory.
The paper presents a detailed account of the semantics of the Russian perfective verb подождать (≈ ‘wait some
time’), which belongs to the family of words focused around the verb ждать ‘wait’. The verb, much like the whole
family, has a set of unique and non-trivial semantic properties that have not been so far adequately represented either
in traditional and computer dictionaries of the Russian language or in scientific descriptions. The main features of
this verb include its peculiar morphological and semantic relationship with the dominant word of the family, the verb
ждать, as well as a ramified valence frame, characterized by rarely occurred means of implementing semantic valencies and unusual conditions of cooccurrence
The paper describes a way to generate a dataset of Russian word forms, which is needed to build an appropriate
neural model for morpheme segmentation of word forms. The developed generation procedure produces word forms
segmented into morphs that are classified by morpheme types, based on existing dataset of segmented lemmas and
additional dictionary data, as well as fine-grained classification of Russian inflectional paradigms, which makes it
possible to correctly process word forms with alternating consonants and fluent vowels in endings. The built representative dataset (more than 1,6 million word forms) was used to develop a neural model for morpheme segmentation of word forms with classification of segmented morphs. The experiments have shown that in detecting morphs
boundaries the model has comparable quality with the best segmentation models for lemmas (98% of F-measure),
slightly outperforming them in word-level classification accuracy (with score 91%).
С
The paper provides the results of the study of the use of the genitive case with partitive semantics as the means
of direct object marking within imperfective verbs in Russian. The genitive partitive is traditionally claimed to be
compatible with perfective verbs and as an exception with imperfective verbs used as the substitution for perfective
verbs in neutralization contexts. The analysis of the data from the Russian National Corpus and the Russian-language
Internet shows that the use of the genitive partitive within imperfective verbs is neither rare nor marginal. The
compatibility level of the genitive and imperfective aspectual correlates of prefixed perfective verbs is dependent on
the imperfectivability level and frequency. The use of the genitive partitive is sensitive to the semantics of the
imperfective, however, it means the coverage of a broader range of phenomena than it is traditionally assumed.
Although the use of the genitive partitive is mostly restricted to neutralization contexts such as iterativity and historical
present, a number of gradual achievement imperfective verbs with progressive semantics as well as verbs that refer
to constant situations are compatible with the genitive partitive.
d
We introduce the first study of detoxification of Russian texts to combat offensive language in social media.
While much work has been done for the English language in this field, it has never been solved for the Russian
language yet. We test two types of models – unsupervised approach based on BERT architecture that performs
local corrections and supervised approach based on pretrained language GPT-2 model – and compare them with
several baselines. In addition, we describe evaluation setup providing training datasets and metrics for automatic
evaluation. The results show that the tested approaches can be successfully used for detoxification, although there
is room for improvement.
Nowadays there has been a growing interest in the topic of Russian text adaptation, both in theoretical aspects of
intralingual translation into Simple and Plain Russian, and in practical tasks like automatic text simplification.
Therefore, it is important to study the characteristics that make an adapted text more accessible. In this paper, we aim
to investigate the strategies that human experts employ when simplifying texts, particularly when the texts are being
adapted for learners of Russian as a foreign language. The main data source for this research is the RuAdapt parallel
corpus, which consists of Russian literature texts adapted for the learners of RaaFL and the original versions of these
texts. We study the changes that occur during the adaptation process on lexical, morphological, and syntax level, and
compare them to the methods usually described in methodological recommendations for teaching RaaFL.
e
The paper presents a fine-tuning methodology of the RuGPT3-XL (Generative Pretrained Transformer-3 for
Russian) language model for the normalization of text spans task. The solution is presented in a competition for
two tasks: Normalization of Named Entities (Named entities) and Normalization of a wider class of text spans,
including the normalization of different parts of speech (Generic spans).
The best solution has achieved 0.9645 accuracy on the Generic spans task and 0.9575 on the Named entities task.
f
This paper contributes to the research field of bimodal linguistics that explores two modalities involved in everyday communication – vocal and kinetic. When exploring almost any scientific phenomenon, one addresses two
opposite issues: individual differences, on the one hand, and general patterns, on the other. We have focused on the
individual differences and proposed a “portrait” approach to communication. We are faced with a difficult task to find
a good metric for analyzing oculomotor behavior of people in everyday communication. In previous papers, starting
from [14], the authors were looking for oculomotor patterns, but their results depend critically on the metric used. In
this paper, we compared the most common metrics and showed that individual differences have a much more serious
weight than general patterns. We then identified four coefficients that determine these individual differences: kaside,
kvip, kchain, and dur75. By comparing these Core Oculomotor Portraits, we were able to make these individual differences more clear. However, a fact is a fact: there are far more individual differences than general patterns between
our Narrators behavior. The proposed coefficients, in our opinion, clearly show (and even explain and predict) the
observed individual differences.
Text Simplification is the task of reducing the complexity of the vocabulary and sentence structure of the text
while retaining its original meaning with the goal of improving readability and understanding. We explore the
capability of the autoregressive models such as RuGPT3 (Generative Pre-trained Transformer 3 for Russian) to
generate high quality simplified sentences. Within the shared task RuSimpleSentEval we present our solution
based on different usages of RuGPT3 models. The following setups are described: 1) few-shot unsupervised
generation with the RuGPTs models 2) the effect of the size of the training dataset on the downstream performance
of fine-tuned model 3) 3 inference strategies 4) the downstream transfer and post-processing procedure using pretrained paraphrasers for Russian. This paper presents the second-place solution on the public leaderboard and the
fifth-place solution on the private leaderboard. The proposed method is comparable with the novel state-of-the-art
approaches. Additionally, we analyze the performance and discuss the flaws of RuGPTs generation.
In the last year, new neural architectures and multilingual pre-trained models have been released for Russian,
which led to performance evaluation problems across a range of language understanding tasks.
This paper presents Russian SuperGLUE 1.1, an updated benchmark styled after GLUE for Russian NLP
models. The new version includes a number of technical, user experience and methodological improvements,
including fixes of the benchmark vulnerabilities unresolved in the previous version: novel and improved tests
for understanding the meaning of a word in context (RUSSE) along with reading comprehension and common
sense reasoning (DaNetQA, RuCoS, MuSeRC). Together with the release of the updated datasets, we improve the
benchmark toolkit based on jiant framework for consistent training and evaluation of NLP-models of various
architectures which now supports the most recent models for Russian. Finally, we provide the integration of
Russian SuperGLUE with a framework for industrial evaluation of the open-source models, MOROCCO (MOdel
ResOurCe COmparison), in which the models are evaluated according to the weighted average metric over all
tasks, the inference speed, and the occupied amount of RAM.
Argumentation mining is a field of computational linguistics that is devoted to extracting from texts and classifying arguments and relations between them, as well as constructing an argumentative structure. A significant obstacle
to research in this area for the Russian language is the lack of annotated Russian-language text corpora. This article
explores the possibility of improving the quality of argumentation mining using the extension of the Russian-language
version of the Argumentative Microtext Corpus (ArgMicro) based on the machine translation of the Persuasive Essays
Corpus (PersEssays). To make it possible to use these two corpora combined, we propose a Joint Argument Annotation
Scheme based on the schemes used in ArgMicro and PersEssays. We solve the problem of classifying argumentative
discourse units (ADUs) into two classes – “pro” (“for”) and “opp” (“against”) using traditional machine learning
techniques (SVM, Bagging and XGBoost) and a deep neural network (BERT model). An ensemble of XGBoost and
BERT models was proposed, which showed the highest performance of ADUs classification for both corpora.
g
Automatic text simplification is a crucial task enabling to reduce text complexity while preserving meaning.
This paper presents our solution to the Russian Sentence Simplification Shared Task (RSSE) based on a backtranslation technique. We show that applying the simple back-translation approach for sentence simplification can
give competitive results with the other methods without fine-tuning or training.
In this study, we test transfer learning approach on Russian sentiment benchmark datasets using additional
train sample created with distant supervision technique. We compare several variants of combining additional data
with benchmark train samples. The best results were achieved using three-step approach of sequential training on
general, thematic and original train samples. For most datasets, the results were improved by more than 3% to
the current state-of-the-art methods. The BERT-NLI model treating sentiment classification problem as a natural
language inference task reached the human level of sentiment analysis on one of the datasets.
The article investigates the semantic of English phrasal verbs (PhVs) which are viewed as lexico-grammatical
constructions. Triangulation of introspective, cognitive and corpus methods of analysis allows us to identify the semantic dimensions which feature the semantic pattern of the PhV-construction. The construction reveals the features
of attraction involving new verbs provided the action or motion event is identical. Depending on the attraction strength
level between the verb and the particle a new verb may be accepted to fill in the corresponding slot of the construction,
which gives rise to a new phrasal verb. It allows us to categorise PhVs according to the attraction level and spot their
PhV-patterns on corpus data.
This paper presents the results of the Russian News Clustering and Headline Selection shared task. As a part of
it, we propose the tasks of Russian news event detection, headline selection, and headline generation. These tasks
are accompanied by datasets and baselines. The presented datasets for event detection and headline selection are
the first public Russian datasets for their tasks. The headline generation dataset is based on clustering and provides
multiple reference headlines for every cluster, unlike the previous datasets. Finally, the approaches proposed by the
shared task participants are reported and analyzed.
i
Leaderboards like SuperGLUE are seen as important incentives for active development of NLP, since they
provide standard benchmarks for fair comparison of modern language models. They have driven the world’s best
engineering teams as well as their resources to collaborate and solve a set of tasks for general language understanding. Their performance scores are often claimed to be close to or even higher than the human performance. These
results encouraged more thorough analysis of whether the benchmark datasets featured any statistical cues that
machine learning based language models can exploit. For English datasets, it was shown that they often contain
annotation artifacts. This allows solving certain tasks with very simple rules and achieving competitive rankings.
In this paper, a similar analysis was done for the Russian SuperGLUE (RSG), a recently published benchmark
set and leaderboard for Russian natural language understanding. We show that its test datasets are vulnerable to
shallow heuristics. Often approaches based on simple rules outperform or come close to the results of the notorious
pre-trained language models like GPT-3 or BERT. It is likely (as the simplest explanation) that a significant part of
the SOTA models performance in the RSG leaderboard is due to exploiting these shallow heuristics and that has
nothing in common with real language understanding. We provide a set of recommendations on how to improve
these datasets, making the RSG leaderboard even more representative of the real progress in Russian NLU.
This paper discusses the experience of developing a web resource intended to study argumentation in popular
science discourse. Such type of argumentation is, on the one hand, the main mean of achieving a communicative goal
and, on the other hand, often not expressed in explicit form. The web resource is built around a corpus of 2256 articles,
distributed over 13 subcorpora. The annotation model, which is based on the ontology of argumentation and D. Walton's argumentation schemes for presumptive reasoning, underlies the argument annotation of the corpus. The distinctive features of the argument annotation model are the introduction of weighting characteristics into text markup
through assessing the persuasiveness of the argumentation, as well as highlighting argumentative indicators visually.
The paper considers a scenario of argument annotation of texts, which allows constructing an argumentative graph
based on the typical reasoning schemes. The scenario includes a number of procedures that enable the annotator to
check the quality of the text markup and assess the persuasiveness of the argumentation. The authors have annotated
162 texts, using the developed web resource, and as a result, identified the most frequent schemes of argumentation
(Example Inference, Cause to Effect Inference, Expert Opinion Inference), as well as described some specific indicators of frequent schemes. Based on the above-mentioned outcomes, the authors listed the indicators of the most frequent schemes of argumentation and made some recommendations for annotators about identifying the main thesis.
The research is focused on definitions of discourse relations, a topic that is currently little-studied. The paper
gives a brief overview of existing solutions for discourse relations definitions: Rhetorical Structure Theory (RST),
Segmented Discourse Representation Theory (SDRT), Penn Discourse Treebank (PDTB), and Cognitive approach to
Coherence Relations. The author shows criteria used to define a discourse relation, or, in case of a narrower definition,
a logical-semantic relation, in these approaches and outlines the shortcomings of the described definitions. The author
also describes the principles used to build the classification and the definitions of logical-semantic relations (LSR) in
the Supracorpora Database of connectives (SDB). The classification is based on four basic semantic operations upon
which rests every LSR's definition: implication, location on the chronological scale, comparison, correlation between
specific and general or an element and a set. The classification consistently distinguishes the levels at which the LSR
can be established: propositional, illocutionary, and metalinguistic. Each LSR is defined on the basis of these two
criteria. Thus, for example, for the LSR of alternative based on the comparison operation, one has the choice between
the LSR of propositional, illocutionary and metalinguistic alternative (We will go to the mountains or to the sea vs.
Put the gun away, or are you scared? vs. The symbol of the year or, simply speaking, cutie-pie). In case of LSRs based
on implication or comparison, the polarity criterion is added, distinguishing whether the LSR is established between
p and q or their negative correlates ¬ p and ¬ q are also to be taken into account in order to obtain a correct interpretation (cf. well-known descriptions of how the Russian conjunction no ‘but’ functions). In addition, semantic and
pragmatic characteristics of the context are also considered in the classification. For example, in the case of the LSR
of specification and generalization, the semantic correlation between p and q (together with their intensional and
extensional interpretations) is taken heed of. Several definitions of LSR and corresponding examples are provided.
Thus, the LSR of extensional specification is defined as follows: based on the operation of correlation between the
general and the particular; established at the propositional level; X contains a generalized notion or state of things p;
Y contains a more particular q-notion, limiting p-extensional. And the LSR of intensional specification is defined as
follows: based on the operation of correlation between the general and the particular; established at the metalinguistic
level; X contains a generalized concept or state of things p; Y contains a more particular q-notion, limiting p-intensional. The definitions used in the SDB definitions make it possible to evaluate, on the basis of the proposed criteria,
the semantic closeness of relations and increase the level of consistency in the work of experts and annotators. That
in turn increases the value of the annotated material, and therefore its reliability.
The paper is focused on divergent ways of conveying discourse relations in translation. For data collection, we
used the supracorpora database of connectives storing parallel texts from the Russian-French subcorpus of the Russian
National Corpus. These data show what logical-semantic relations tend to be translated using divergent ways, i.e.
other than connectives (exclusion in its various gradations, propositional concomitance and substitution, the share of
divergent translations ranging from 30% to 50%). Also, such data help define what causes divergent ways of translation to be used. The causes may be as follows: (a) the lack of an adequate equivalent of a given connective in the
target language; (b) differences in the syntactic structure of the source and target languages; (c) usage differences; (d)
contextually determined use of divergent translation. If there is a prototypical indicator of logical-semantic relations
(i.e. connective) in the source text, it also occurs in translation in more than 90% of cases. The data on human translations are then compared with those on machine translations, which shows that the machine translation system also
tends to keep a connective if there is one in the source text (it occurs in almost 98% of cases). However, there are
cases where the machine translation system has difficulties processing а multiword connective (failing to perceive it
as a whole) or a polyfunctional unit (failing to tell a connective from a non-connective) and thus uses divergent ways
to translate it. Some causes of divergently translating connectives are likely to be the same for human and machine
translations. These are differences in the syntactic structure of languages and usage differences. Further research of
divergent means of conveying discourse relations will allow to draw a sharper border-line between explicitly expressed and implicit discourse relations. The data collected from annotated corpora (both monolingual and multilingual and parallel) will help determine what the divergent ways of expressing logical-semantic relations are and how
frequently they are used. The research results can be used both in automatic text processing and automatic text generation. Also, the data on divergent translations of discourse relations can serve to improve the machine translation
quality.
The categories of concreteness and specificity are important for understanding the mechanisms of information
representation and processing in human brain. These two categories are quite close, but still different. A method for
quantifying the degree of correlation of these categories for the English has recently been proposed. This paper deals
with a similar research of the Russian. Ratings from the Concreteness/Abstractness Dictionary (RDCA) are taken as
a measure of the words’ concreteness. The degree of a word specificity is estimated by its location in the RuThes
thesaurus. The paper represents the comparison with the English data and shows the similarity of the results for Russian and English.
k
Nowadays, BERT models have found wide use in the NLP field. However, standard BERT architecture training
can be stifled by the lack of labels for different tasks while treating multitask settings as a one-task multilabel
setting. For every example, we have labels from this example’s source task but not from other tasks. This article
addressed this issue, exploring eight different data pseudo-labeling approaches in the GLUE 4-task setting. These
approaches do not require changes in samples or model architecture. One of the presented techniques excels results
on RTE from the original article, by 6.2 %, and falls behind the original article on QQP, MNLI, and SST only by
0.5-1.2 %. This way also excels other pseudo-labeling approaches explored in the article by 0.5-2% on average if
we consider similar tasks. However, for tasks that are dissimilar to each other, different proposed approach yields
the best results.
The task of the semantic role labeling usually focuses on identifying and classifying the core, obligatory arguments of the predicate. The adjuncts of Time, Location, etc. (noncore, modifier arguments) are considered on the
periphery of the task [30] and even doing the easy part of it [44], despite the fact that they are highly integrated into
the clause structure and may nontrivially interact with the meaning of the verb [4, 32]. In this paper, we present
experiments on labeling the adjunct roles of LOCATION, TIME, MANNER, DEGREE, REASON, and PURPOSE,
based on the manually annotated AdjunctsFrameBank data set. The results show an average F1score of 0.94 on the
gold adjunct phrase annotations using the word2vec representations of adjuncts, word2vec representations of predicates, and the moprhosyntactic marking of adjuncts. Our findings generally corroborate the theoretical hypothesis
on the structural and semantic autonomy and lexicomorphosyntactic specialization of adjuncts. Yet, more complicated organization of their network is revealed, pointing to the diversity of adjuncts in terms of their distribution
and behavior.
This paper will focus on the development of a new computational system, Prosimetron, which enables comparative statistical studies of the rhythm of verse and prose in different languages (currently 10 languages are operative,
with the possibility of adding more). The results of the analysis can be used not only for studying the processes for
the genesis, expansion, and modification of various versification systems, but also for commenting on and interpreting
the verse rhythm in different national poetic traditions in comparison with their foreign sources and language prosody.
In addition, the possibility to model various processes of poetic speech generation and to analyze rhythmic vocabularies of prose allows hypotheses about the cognitive mechanisms of verse generation. This system operates in a semiautomatic mode and, by minimizing errors and enabling the processing of large amounts of data, provides a unique
tool for computer research on the rhythm of different modes of speech.
This paper provides results of participation in the Russian News Clustering task within Dialogue Evaluation
2021. News clustering is a common task in the industry, and its purpose is to group news by events. We propose
two methods based on BERT for news clustering, one of them shows competitive results in Dialogue 2021 evaluation. The first method uses supervised representation learning. The second one reduces the problem to binary
classification.
This paper describes the results of the first shared task on speech processing for lowresource languages of
Russia. Speech processing tasks are notoriously dataconsuming. The aim of the shared task was to evaluate the
performance of stateoftheart models on lowresource language data as well as draw the attention of experts to field
linguistics data (using Lingovodoc project data). The tasks included language identification and IPA transcription,
with three teams participating in them. The paper also provides a description for the datasets as well as an analysis
of the participants’ solutions. The datasets created as a result of the shared task can be used in other tasks to enhance
speech processing and help develop modern NLP tools for both speech communities and field linguists.
The present paper analyzes the intonation of pragmatic particles da "yes" and net "no" found in the spontaneous
dialogue speech corpus of a Northern Russian dialect, in which each word bears a pitch accent. Intonation that marks
such particles sounds unusual for speakers of Standard Russian and is perceived by them as blunt and impolite. The
main aim was to find a consistent pattern explaining the distribution of falling and rising pitch accents on such particles in a dialect of Vaduga (Arkhangelsk region). We tested three hypotheses that can account for this distribution: (a)
semantic explanation (the type of pitch accent depends on the semantics of the very particle); (b) communicative
explanation (it depends on the communicative function of the preceding utterance, that is, whether it is a question or
not); (c) phonetic explanation (it depends on the pitch accent of the preceding utterance). A total of 240 utterances
from 3 speakers were analyzed. Results showed that the semantics of the particle is not a relevant factor, while the
communicative type and the pitch accent of the preceding utterance are significant predictors of the pitch accent that
marks the particle, with the latter better explained the data. We propose that when analyzing the intonation of a dialect,
semantic interpretation of the intonational constructions of the standard dialect should not be taken into account.
Moreover, we suggest that a new approach of collecting prosodic data with elderly people while controlling for pragmatic context is needed.
The paper discusses the notion of parentheticals in Russian spoken discourse. Using data from two prosodically
annotated corpora — “Stories about presents and skiing” and “Russian Pear Chats & Stories” — I advocate for a
discourse-oriented approach to parenthetical constructions. I define a parenthetical construction as consisting of three
elements: the left context, the parenthetical unit, and the right context. Each element constitutes a separate discourse
unit and is thus prosodically autonomous. I rely on the notion of projection [Auer 2005] to account for the discourse
relationships between these three components. When the speaker pronounces the left context, she projects a continuation that is to be realized in the right context, while the parenthetical unit provides a digressive discourse step.
Typically (around 50% in my data), parentheticals are anchored to their left contexts and are pronounced with a
falling or level pitch accent. Noted deviations from this prototype include free parentheticals, parenthetical uses of
vot, and parentheticals pronounced with a rising pitch accent. Furthermore, I explore two prosodic features frequently
associated with parentheticals, namely, increased articulation rate and pitch range narrowing. I show that, while both
these tendencies are statistically significant, the latter has a larger effect size than the former.
This paper describes FineMotion’s gesture generating system entry for the GENEA Challange 2020. We start
by using simple baselines and expand them by using context and combining both audio and textual features. Among
the participating systems, our entry attained the highest median score in the human-likeness evaluation and second
highest median score in appropriateness.
Currently, there are more than a dozen Russian-language corpora for sentiment analysis, differing in the source
of the texts, domain, size, number and ratio of sentiment classes, and annotation method. This work examines publicly
available Russian-language corpora, presents their qualitative and quantitative characteristics, which make it possible
to get an idea of the current landscape of the corpora for sentiment analysis. The ranking of corpora by annotation
quality is proposed, which can be useful when choosing corpora for training and testing. The influence of the training
dataset on the performance of sentiment analysis is investigated based on the use of the deep neural network model
BERT. The experiments with review corpora allow us to conclude that on average the quality of models increases
with an increase in the number of training corpora. For the first time, quality scores were obtained for the corpus of
reviews of ROMIP seminars based on the BERT model. Also, the study proposes the task of the building a universal
model for sentiment analysis.
The study explores the discourse formulae (DFs) of disagreement in Russian and English belonging to the subclasses of refusal and prohibition. Starting with a subset of six Russian target DFs, we establish their English equivalents using corpus analysis. We also define the typical speech acts to which the DFs in both languages react, and
design model contexts that exemplify these types of speech acts. We use the model contexts as stimuli in our Russian
and English surveys where we look at the preferences of native speakers in choice of DFs across the speech acts. We
use the data of the surveys to establish the pragmatic function of each DF, (i.e. refusal or prohibition, or both), and
their potential in each subclass (strong, medium, or weak). For each DF, we also identify the types of speech acts to
which they react most readily. We compare the results of our analysis to the lexicographic description of the target
DFs as presented in the Russian-English Dictionary of Idioms.
The paper considers constructions «predicative + infinitive». For the first time, a class of interpretive infinitive constructions (opposed to emotional reactions) is introduced. For emotional reactions, the predicative and the infinitive refer to the
same subject, the infinitives of the perception, mental, speech verbs are typical for them: It hurts / scares to see how forests are
dying (‘X sees, X is scared’) → It hurts that forests are dying. For interpretive constructions, the subjects of the predicative and
the infinitive do not coincide: It is heartless to separate the mother from the children – ‘X separates, Y evaluates such an act as
heartless’. The infinitives of perceptual and mental verbs in such a construction are either not used, or they denote a kind of
action: It is tactless to listen to private conversations.
l
The article focuses on the role of animacy in Russian and French pronominal systems. Although animacy is a
grammatical category only in Russian, while in French it is not reflected in the behavior of nouns, it turns out that
some animacy-based restrictions on the use of anaphoric and demonstrative pronouns are common for the two languages. We address syntactic restrictions that affect the following types of uses: (i) use of anaphoric pronouns in
copular constructions; (ii) repetition of anaphoric pronouns for the sake of clearness and / or emphasis; (iii) deictic
use of anaphoric pronouns; (iv) anaphoric use of demonstrative pronouns. In all the four cases, except, perhaps, the
fourth one, pronouns tend to have an animate referent, while inanimate ones are more problematic. We conclude that
these restrictions mainly result from the fact that animate objects have a greater discourse importance and more often
become the main subject of the discourse than inanimate ones. At the same time, degree of strictness of restrictions
sometimes differ between the two languages: for instance, demonstrative pronouns in the anaphoric use tend to have
an animate antecedent in Russian, while for French, this tendency is weaker.
The modal particle uzh is perhaps the most difficult Russian discourse word to describe since its semantics is
highly elusive. The existing descriptions are rather abstract and poorly correlate with various cases of usage of uzh.
Besides, they do not take into consideration several crucial components of this particle’s meaning. For instance, in
phrases like Uzh ya-to znayu (‘I do know’) one can notice a hugely important component of meaning - the idea of a
scale. One can say Ya-to etot sekret znayu, a vot drugim nevdomek (‘I do know the secret, whereas others have no
idea about it’), and in this example, uzh would be irrelevant. Uzh ya-to eto znayu presupposes that others probably
know it too, but it’s me who knows it for sure. This very idea of a scale and poles together with the idea of the
exceedance of expectations (which is also important for the meaning of uzh) constitutes the semantic contribution that
this particle makes. Moreover, uzh partly smooths the opposition between the central and other elements of a multitude, because it does not exclude them from consideration, it just gives emphasis to that one.
The aim of this research is to examine those types of uzh usage, where the idea of a scale is most clearly actualized. Probably, if we understand how the significant components of this particle’s meaning function, we will get closer
to the development of a complete picture of its usage. For example, the idea of a scale within the meaning of uzh is
expressed in the context of a special question (Zachem uzh tak zlo? ‘Why so mean?’). In an argument uzh often
implies that the speaker was almost ready to back down, but not to this extent - like in a famous poem by Daniil
Kharms called «Liar» (1930). The idea of a scale is vividly realized in the context of an implicit (Gde uzh mne!, ‘How
can I…’) or explicit negation. It is especially interesting to pay attention to the peculiar effects of the combination of
uzh with comparative forms (luchshe uzh, ‘it would be better...’). The usage of uzh in standard word combinations
raz uzh, esli uzh, togda uzh has its restrictions, also connected with the idea of a scale. The development of a modal
meaning in a temporal word, which brings the transformation of a timeline into a scale of expectations or possibilities,
is quite typical.
m
In the present paper, we analyzed a group of Russian nouns denoting professions and social roles. Historically,
these nouns were masculine; in modern Russian, they can also be used with feminine agreement, but only nominative forms are regarded as normative (e.g. etot / eta vrač ‘thisM/F doctor’). We showed that oblique case feminine
forms occur naturally using the Web-as-corpus approach and conducted three experimental studies. We discovered
that offline rating and online processing of such forms depends on their case. Firstly, this is a unique example of the
properties of the form influencing the properties of the lexeme. Secondly, the fact that all oblique forms are regarded as marginal and that locative was found to be significantly worse than other oblique cases points to a deep connection between grammatical gender and inflectional classes and to the crucial role of affix syncretism in morphological processing. This presents a challenge for different approaches in theoretical morphology.
This paper presents the results of the study devoted to the applicability of SOTA methods for morphological
corpus annotation (based on GramEval2020) for analytical sociolinguistic research. The study shows that statistically
successful technologies of morphosyntactic annotation for such purposes create a number of problems for researchers
if they are used purely i.e. without any linguistic knowledge. In this paper, methods for improving the morphological
annotation, successfully implemented in GICR, from the point of view of its reliability are presented.
Experiments have been carried out in which human subjects incrementally constructed dependency trees of
English sentences. The subjects were successively presented with growing initial segments of a sentence, and had to
draw syntactic links between the last word of the segment and the previous words. They were also shown a fixed
number of lookahead words following the last word of the segment. The results of the experiments show that
lookahead of 1 or 2 words is sufficient for confident incremental parsing of English declarative sentences.
The paper deals with communication failures in everyday spoken discourse. The spontaneous character of oral
speech is its basic property and becomes a prerequisite for the appearance of such a phenomenon as communicative
failures. By communicative failures, we mean speech situations when the recipient of a speech message does not
understand it correctly, i.e., in the way the speaker intended. The purpose of this pilot study is 1) to assess the total
number of communication failures that occur with a person during a single day and 2) to determine the dependence
of communication failure frequency on the communication settings and conditions. The main result of the study is a
qualitative and quantitative assessment of communication failures during a subjects’s day. The research is based on a
special experiment based on 24-hour monitoring of the subject’s speech and his subsequent retrospective commentary
on all recorded data. Such an approach allows one to reduce the subjectivity inherent in much linguistic work. The
research continues a series of studies devoted to the effectiveness of spoken communication and is important not only
for understanding the fundamental processes of speech perception but is also crucial for the development of artificial
intelligence systems involving human-computer speech dialogue systems and for speech technologies of the next
generation.
o
We propose an unsupervised complex scoring function (RuSimScore) to measure simplification quality of Russian sentences, and a model for text simplification based on this function. The function allows to score simplicity
and original meaning preservation. First, filtered a noisy parallel corpus (machine translated WikiLarge) and extracted good simplification examples. After that, a pretrained language model was fine-tuned on these examples.
We generate multiple outputs from the language model and select the best one according to the scoring function.
The weights in the scoring function can be adjusted to balance between better content preservation and getting
simpler sentences (controllable simplification).
p
We present the first shared task on diachronic word meaning change detection for the Russian. The participating
systems were provided with three sub-corpora of the Russian National Corpus — corresponding to pre-Soviet,
Soviet and post-Soviet periods respectively — and a set of approximately one hundred Russian nouns. The task
was to rank those nouns according to the degrees of their meaning change between periods.
Although RuShiftEval is in many respects similar to the previous tasks organized for other languages, we
introduced several novel decisions that allow for using novel methods. First, our manually annotated semantic
change dataset is split in more than two time periods. Second, this is the first shared task on word meaning change
which provided a training set.
The shared task received submissions from 14 teams. The results of RuShiftEval show that a training set
could be utilized for word meaning shift detection: the four top-performing systems trained or fine-tuned their
methods on the training set. Results also suggest that using linguistic knowledge could improve performance on
this task. Finally, this is the first time that contextualized embedding architectures (XLM-R, BERT and ELMo)
clearly outperform their static counterparts in the semantic change detection task.
Based on data from the multimedia subcorpus of the Russian National Corpus, the paper addresses syntactic,
sematic and prosodic features of the particular type of quotations with the reporting frame headed by the subordinator
kak ‘as’ (kak skazal mne staryj rab pered tavernoj…). Our data show mixed evidence regarding the parenthetical
status of the construction. On the one hand, typically for parentheticals, its function is clearly pragmatized, since it
expresses speaker’s attitude towards the quote. On the other hand, typical parentheticals have only loose syntactic
connection with their “host”, while the kak-phrase is introduced by the subordinator and has the form of the standard
adverbial clause. Further on, while typical parentheticals are characterized by grammatical and prosodic reduction,
grammatical and prosodic restrictions operating in the kak-phrase are optional and context (e.g., word order) sensitive.
The kind of data we present supports the approach to parenthesis that doesn’t favor either/or decisions, but rather is
based on multifactorial analysis that considers the whole range of possible parameters and isolates their observed
language-specific clusters.
The paper deals with elaborating different approaches to the machine processing of semantic sketches. It
presents the pilot open corpus of semantic sketches. Different aspects of creating the sketches are discussed, as
well as the tasks that the sketches can help to solve. Special attention is paid to the creation of the machine processing tools for the corpus. For this purpose, the SemSketches2021 Shared Task was organized. The participants
were given the anonymous sketches and a set of contexts containing the necessary predicates. During the Task, one
had to assign the proper contexts to the corresponding sketches.
Recent techniques for the task of short text clustering often rely on word embeddings as a transfer learning
component. This paper shows that sentence vector representations from Transformers in conjunction with different
clustering methods can be successfully applied to address the task. Furthermore, we demonstrate that the algorithm
of enhancement of clustering via iterative classification can further improve initial clustering performance with
classifiers based on pre-trained Transformer language models.
r
Consulting word definitions from a dictionary is a familiar way for a human to find out which senses a particular
word has. We hypothesize that a system that can select a proper definition for a particular word occurrence can
also naturally solve Semantic Change Detection (SCD) task. To verify our hypothesis, we followed an approach
previously proposed for Word Sense Disambiguation (WSD) and trained a system that embeds word definitions and
word occurrences into the same vector space. In this space, the embedding of the most appropriate definition has
the largest dot product with a contextualized word embedding.
The system is trained on an English WSD corpus. To make it work for the Russian language, we replaced BERT
with the multilingual XLMR language model and exploited its zeroshot crosslingual transferability. Despite not
finetuning the encoder model on any Russian data, this system achieves the second place in the competition, and
likely works for any of one hundred other languages XLMR was pretrained on, though the performance may vary.
We then measure the impact of such WSD pretraining and show that this procedure is crucial for our results. Since
our model was trained to choose a proper definition for a word, we propose an algorithm for the interpretation and
visualization of the semantic changes through time.
By employing additional labeled data in Russian and training a simple regression model, that converts the
distances between output contextualized embeddings into more humanlike scores of sense similarity between word
occurrences, we further improve our results and achieve the first place in the competition.
The study focuses on switching from talk to work in an “inclusivity workshop” for people with mental disabilities. Work activities and conversation about general topics can be approached from the perspective of multiactivity
and considered courses of actions intertwined in social interaction. The order of activities is negotiated among participants using both linguistic and non-linguistic means. The data are extracts of video recordings containing a participant getting others to do things. The paper provides multimodal analysis of 6 cases of an instructor getting an autistic
participant to switch to work, which occurred within a 17-minute conversation about animals. In the data, the autistic
participant never provides a second-pair response to a directive. In 5 out of 6 cases analysed in the paper he fulfils the
action to different extents, demonstrating various degrees of involvement. Getting the autistic person to switch to
work is more effective when suggesting actions one by one, through concrete embodied actions, and when orienting
to phases of the ongoing talk. The study highlights differences between autistic and non-autistic participants switching
from one course of actions to another. Considering goals of an inclusivity workshop, success of switching to work
can be also determined by the opportunities for the smooth conversation.
The paper presents the models detecting the degree of semantic change in Russian nouns developed by the team
aryzhova within the RuShiftEval competition of the Dialogue 2021 conference. We base our algorithms mostly
on unsupervised distributional models and additionally test a model that uses vectors representing morphological
preferences of the words in question. The best results are obtained by the model built on the ELMo architecture
with a small window, while the quality of performance of the “grammatical” model is comparable to that of the
models based on much more sophisticated algorithms.
s
This report presents the results from the RuSimpleSentEval Shared Task conducted as a part of the Dialogue
2021 evaluation campaign. For the RSSE Shared Task, devoted to sentence simplification in Russian, a new middlescale dataset is created from scratch. It enumerates more than 3000 sentences sampled from popular Wikipedia
pages. Each sentence is aligned with 2.2 simplified modifications, on average. The Shared Task implies sequenceto-sequence approaches: given an input complex sentence, a system should provide with its simplified version. A
popular sentence simplification measure, SARI, is used to evaluate the system’s performance.
Fourteen teams participated in the Shared Task, submitting almost 350 runs involving different sentence simplification strategies. The Shared Task was conducted in two phases, with the public test phase allowing an unlimited
number of submissions and the brief private test phase accepting one submission only. The post-evaluation phase
remains open even after the end of private testing. The RSSE Shared Task has achieved its objective by providing
a common ground for evaluating state-of-the-art models. We hope that the research community will benefit from
the presented evaluation campaign.
This paper describes our solution for the RuSimpleSentEval shared task on sentence simplification held together
with Dialogue 2021 сonference. Our approach was to filter the provided dataset, finetune the pretrained ruGPT3
model on it and select generated simple candidates based on cosine similarity and ROUGEL with a complex
sentence as an input. The system achieved SARI 38.49 and took third place in the competition. We have reviewed
and analyzed examples of simplified sentences produced by the model. The analysis showed that the sentences
produced by the system lose the original meaning of the input sentence in about half of the cases.
This paper deals with the Russian particle zhe and its use in the Russian translations from English and demonstrates the possibilities of “one-focus analysis” in contrastive studies based on the parallel corpora. It correlates the
explications of zhe given in earlier studies (it makes special reference to the Active Dictionary of Russian) with the
stimuli to translation, that is, fragments of the original English text that might cause the appearance of zhe in a Russian
translation as a reaction to those stimuli. The study sought to validate, disprove or improve the semantic analysis of
zhe made without recourse to electronic corpora.
The analysis of the stimuli that have led Russian translators to use the particle zhe reveals important characteristics of this word. It turns out that the Russian particle zhe is often pragmatically obligatory as its absence would violate
the idiomatic nature of the utterance and change its illocutionary force. It is often the case that if a translator had given
word-for-word translation, that is without a particle, they would convey the precise meaning, but the translation would
be inadequate: the wrong implicature would appear. On the other hand, when they add the particle, they may impart
new shades of meaning which the original text did not contain.
This article studies the characteristics of implicit and explicit types of aggression in the comments of a Russian
social network with the means of machine learning. As it is hypothesized that expression of aggression depends on
local norms, the dataset contains the comments collected from a single social media community. These comments
were divided into three classes: polite communication, implicit aggression, and explicit aggression. Trying different
combinations of data preprocessing, we discovered that lemmatization and replacement emojis with placeholders
contribute to better results. We tested several models (Naive Bayes, Logistic Regression, Linear Classifiers with SGD
Training, Random Forest, XGBoost, RuBERT) and compared their results. The study describes the misclassifications
and compares the keywords of each class of comments. The results can be helpful while enhancing the algorithm of
detection of implicit aggression.
The paper presents the results of a study that is part of a large-scale project aimed at studying the changes that
took place in the Russian language during the first three decades of the 20th century. In the history of Russia, this
period was marked by stormy events that led to a radical change in the state system and the formation of a new society. To quantify the scale of changes that occurred in the language in the result of these dramatic events, it is necessary to analyze the representative volume of linguistic data and to compare different chronological periods in dynamics using quantitative methods. The research was carried out on the data of an annotated sample from the Corpus of the Russian Short Stories of 1900-1930, which contains texts by 300 Russian writers. All the texts in the
Corpus are divided into three time frames: 1) the pre-war period (1900-1913), 2) the war and revolutionary years
(1914-1922) and 3) the early Soviet period (1923-1930). Frequency distribution of significant vocabulary in dynamics was analyzed, which made it possible to identify the main tendencies in the change of individual words and lexical groups frequencies from one historical period to another and to correlate them with the previously identified dynamics of literary themes. The technique used allows to trace the influence of large-scale political changes on the
vocabulary of literary language, to note the peculiarities and tendencies of the writers' worldview in a certain historical period, and also makes it possible to significantly supplement the analysis of the dynamics of literary themes in
fiction.
This study contributes to a better understanding of reading intercomprehension as manifested in the
intelligibility of East and South Slavic languages to Russian native speakers in contextualized cognate recognition
experiments using Belarusian, Ukrainian, and Bulgarian stimuli. While the results mostly confirm the expected
mutual intelligibility effects, we also register apparent processing difficulties in some of the cases. In search of an
explanation, we examine the correlation of the experimentally obtained intercomprehension scores with various
linguistic factors, which contribute to cognate intelligibility in a context, considering common predictors of
intercomprehension associated with (i) morphology and orthography, (ii) lexis, and (iii) syntax.
t
The paper explores the discourse marker ja vižu (lit. ‘I see’) and its cross-linguistic counterparts. We argue that
it presents its scope proposition as the product of abduction, a logical inference that derives the optimal explanation
for the observed state of affairs. This view is supported by the set of observations suggesting that restrictions on the
distribution of ja vižu are mostly derivable as restrictions on abuctive reasoning, which involve informativeness, likelihood and parsimony considerations.
In this paper we consider the taxonomy enrichment task based on a recently appeared dataset, called Diachronic
wordnets, created on the basis of English and Russian wordnets. We study meta-embeddings approaches, which
combine several source embeddings, to the hypernym prediction of novel words and show that meta-embedding
approaches obtain the best results for this task if compared to other methods based on different principles. When
combining with automatically extracted features from the Wiktionary online dictionary, the joint approach improves
the results.
v
Computation of text similarity is one of the most challenging tasks in NLP as it implies understanding of semantics beyond the meaning of individual words (tokens). Due to the lack of labelled data this task is often accomplished
by means of unsupervised methods such as clustering. Within the DE2021: “Russian News Clustering and Headline
Selection” we propose a method of building robust text embeddings based on Sentence Transformers architecture,
pretrained on a large dataset of in-domain data and then fine-tuned on a small dataset of paraphrases leveraging
GlobalMultiheadPooling.
In this work, we present a novel approach to one of computational paralinguistic tasks – automatic detection
of deceptive and truthful information in human’s speech. This task belongs to the aspects of destructive behaviour
and was first presented at the International INTERSPEECH Computational Paralinguistics Challenge ComParE in
2016. The need of contactless method for deception detection follows from the fact that existing contact-based
approaches such as polygraphs and lie detectors have multiple restrictions, which significantly limit their usage.
Both for training and testing of the proposed models we used two English-language corpora (Deceptive Speech
Database and Real-Life Trial Deception Detection Dataset). We extracted tree sets of acoustic features from those
audio samples using openSMILE toolkit. The proposed approach includes preprocessing of the extracted acoustic
features with the usage of methods for data augmentation and dimensionality reduction of feature space. We have
got 1680 speech utterances and 986-dimensional informative feature vector for each utterance. The main part
of the proposed approach is two-level recognition model, where the first level includes three models of gradient
boosting (Catboost, XGBoost and LightGBM). The second level consists of logistic regression-based model for
final prediction on truthfulness or deceptiveness that takes into account predictions from the first level. Using this
approach, we have achieved the result of classification in terms of F-score = 85.6%. The proposed approach can
be used both independently and as a component of multimodal systems for detection of deceptive and truthful
utterances in speech, as well as in systems for detection of a destructive behaviour.
In this paper, we explore various multilingual and Russian pre-trained transformer-based models for the Dialogue Evaluation 2021 shared task on headline selection. Our experiments show that the combined approach is
superior to individual multilingual and monolingual models. We present an analysis of a number of ways to obtain
sentence embeddings and learn a ranking model on top of them. We achieve the result of 87.28% and 86.60%
accuracy for the public and private test sets respectively.
y
This paper is aimed at establishing the parameters of the dialogic communication expressed through Russian
prosody. The linguistic and extra-linguistic constituents of dialogue are analyzed. These are: the illocutionary meanings that generate speech acts, characteristic of the dialogic communication; the discourse links that combine the
successive speech acts of one interlocutor if his/her current contribution into the dialogue is not limited to a single
speech act; the prosodic characteristics of genre typical for a concrete type of communication (a friendly talk, an
exam, a press conference, a scientific presentation, or an interrogation). The proposed taxonomy is based on the
analysis of the minor working corpus of spoken dialogues from the Russian National corpus (Multimodal sub-corpus
Murko), the annotated database Spokencorpora.ru, video-hosting Youtube.com, films, scientific conferences, and
press conferences. The computer system Praat is used to analyze the sound data. The paper is illustrated with tracings
of sound records.
z
The article analyzes the meaning of Russian discursive words vidimo and po-vidimomu (‘apparently’),
and reconstructs the ways of their semantic evolution over the past two centuries. It is shown that the meaning
of an inference made by the speaker on the basis of some data, which is the only one for both words in modern
language, arose in different ways. The semantic evolution of both words includes the the replacement of the
meaning of visual perception with the meaning of epistemic evaluation and the acquisition of egocentric
semantics. The word vidimo initially served as a marker of a true visual impression; the word po-vidimomu
which initially included an interpretative component, acquired the meaning of a potentially false judgment,
which was subsequently lost. The research is based on texts included in the Russian National Corpus
(www.ruscorpora.ru).
The Ancient Greek WordNet is a new resource that is being developed at the Universities of Pavia and Exeter,
based on the Princeton WordNet. The Princeton WordNet provides sentence frames for verb senses, but this type of
information is lacking in most WordNets of other languages. In fact, exporting sentence frames from English to other
languages is not a trivial task, as sentence frames depend on the syntax of individual languages. In addition, the
information provided by the Princeton WordNet is not corpus-based but relies on native speakers’ knowledge. This
type of information is not available for dead languages, which are by definition corpus languages. In this paper, we
show how sentence frames can be extracted from morpho-syntactically parsed corpora by linking an existing dependency lexicon of Homeric verbs (HoDeL) to verbs in the Ancient Greek WordNet. Given its features, HoDeL allows
automatically extracting all subcategorization frames available for each verb along with information concerning their
frequency as well as semantic information regarding the possible arguments occurring in specific frames. In the paper,
we show our method to automatically link the two resources and compare some of the resulting sentence frames with
the English sentence frames in the Princeton WordNet.
Basing on the frequency dictionary of Russian predicatives, I measure the volume of the lexical class of nonagreeing predicatives licensing the productive dative-predicative sentence pattern, where the predicative assigns dative case to its animate subject. The tested vocabulary includes 422 elements. Their frequency rates are derived from
the main corpus of RNC using an approximation — the number of hits in the context “predicative + dative subject in
1Sg” in the window {-1; 1}. I argue that the Russian dative-predicative construction has an invariant meaning of
internal state, i.e. spaciotemporal stative situation with a priority argument. However, most predicatives licensing
dative-predicative structures in Russian also express external states, i.e. spaciotemporal stative situations without a
priority argument, if used without overt referential dative subject. This can be proved both for words denoting physical
sensations, cf. X-y kholodno ‘X is cold’ vs kholodno ‘It is cold’ and for some words denoting affections, cf. tosklivo
‘dreary’, ‘sad’, Х-у tosklivo ‘Х feels sad’ vs zdes’tosklivo ‘It’s dreary here’. The shift from internal state to external
state is licensed in Russian. If a lexical item has regular uses in the dative-predicative structure, it generally can
express the meaning of external state outside this structure. The reverse if false: if a lexical item has regular uses as
an external state, cf. vetreno ‘windy’, pyl’no ‘dusty’, it only can have infrequent side uses with a dative subject. This
asymmetry is confirmed by the corpus data. I check an additional list of words with the meaning of external state,
measure their frequency rate in the context “predicative + dative subject in 1Sg” in the window {-1; 1} and compare
them to standard dative predicatives.
.