The work is devoted to the experimental study of structures with split scrambling. With the help of acceptability
assessment methods with the use of the Likert scale, self-paced reading, and prosodic experiment, the possibility of
separating the left element (determinator or possessor) from the noun head within DP or PP is compared with the
separation of the nominal head from its complement (dependent infinitive or prepositional phrase). The results show
that separating the head for Russian speakers is not only possible but also rated higher than separating the left element
from the head. This pattern is explained by the requirements of the information structure: the left element that has
been fronted cannot form the only topic of the clause. The low scores are consistent with existing experimental studies;
however, the results of the reading time appear to be inconsistent with existing views about the cognitive load required
to process split sentences.
Материалы
Сборник (РИНЦ)
Полная версия сборника (РИНЦ)
Студенческая сессия
b
This paper presents a pilot study intended to enhance the semantic and conceptual description of Ancient Greek
verbs in WordNet with information from two other resources, VerbNet and FrameNet, and to enrich a treebank of
Ancient Greek texts with semantic information extracted from the three resources. We provided semantic annotation
for verbs based on their morphosyntactic behavior, and performed a number of queries in order to extract occurrences
from the Ancient Greek treebank that intended to match the different meanings of each verb. The manual check of the
data extracted shows that, in spite of a limited number of mismatches, our queries yielded reliable results. The queries
can be further refined in the future and complemented with a rule-based algorithm to map frame elements to
dependency structure.
This study analyzes the reactions of the Italian Twitter community to an environmental demonstration that occurred in Rome
on January 2nd, 2023. We compiled a corpus of 368,531 tokens consisting of 11,780 tweets, collected during a 7-day period.
We propose a mixed-method approach that combines automated and manual corpus analyses of sentiment, emotions, and
implicit language. Our findings offer insights into how tweets reflected the users’ attitudes toward a variety of subjects and
entities. Although the sentiment of the overall debate was distributed rather evenly, the incident itself seems to have sparked
negative sentiment and emotions among Twitter users. The results of our manual analyses revealed some issues with respect
to the automatic classification of sentiment, due to the fact that some tweets contained irony, sarcasm, and slurs. Non-literal
interpretations were ignored by the tools at hand that could not account for complex rhetorical-argumentative strategies.
We study performance of BERT-like distributive semantic language models on anaphora resolution and related
tasks with the purpose of selecting a model for on-device inference. We have found that lean (narrow and deep)
language models provide the best balance of speed and quality for word-level tasks, and opensource1 RuLUKE-tiny
and RuLUKE-slim models we have trained. Both are significantly (over 27%) faster than models with comparable
accuracy. We hypothesise that the model depth may play a critical role for performance as, according to recent
findings each layer behaves as a gradient descent step in autoregressive setting.
Diachronicon: a new resource for the study of Russian constructions in a microdiachronic perspective
The article is devoted to the linguistic characteristics of the database "Diachronicon" and describes the features
of the diachronic markup of Russian language constructions, as well as tags specially designed for searching through
a diachronic database. A special comment field used in the database is separately justified. In addition, the computer
interface of the “Diachronicon" is presented and described.
The developed resource provides extensive opportunities for systematic study of not only specific constructions,
but also general mechanisms of idiomatization and grammaticalization. The database allows the researcher to simultaneously compare several separate plots, search through a list of constructions and their characteristics in diachrony,
track the history of syntactic and semantic changes and limitations of compatibility of different constructions.
The study highlights the asynchronous nature of modern group chats and related problems such as retrieving
relevant information on the asked question and understanding reply-to relationships. In this work, we formalize the
reply recovery task as a building block toward solving described problems. Using simple heuristics, we try to apply
the result reply recovery model to a thread reconstruction problem. As a result, we show that modern pre-trained
models such as BERT show great results on the task of reply recovery compared to more simple models, though
it cannot be applied to thread reconstruction with just simple heuristics. In addition, experiments have shown that
model performance depends on the chat domain. We open-sourced a model that can automatically predict which
message the particular reply responds to and provide a representative Russian dataset that we built from Telegram
chats of different domains. We also provide a test set for a thread reconstruction task.
c
We consider a model of binary classifier predicting occurrence of microsyntactic units in sentences. The model
is based on AWD-LSTM architecture with an encoder pre-trained on the Russian version of Wikipedia and further
trained on a dataset built from the SynTagRus corpus supplied with a microsyntactic markup. We present the structure
of the model and discuss its output. The study showed that binary classification allows targeting of microsyntactic
markup and helps to significantly improve its recall.
e
This article presents a corpus of Byzantine accentuated texts (BGAT) created since 2008. It currently includes 1010
Byzantine inscriptions, 950 papyri from various collections from the 1st to the 9th centuries, 132 seals from the collection
of Dumbarton Oaks, and a selection of 100 Athos manuscripts from the 8th to the 15th centuries. Based on the collected
data, we developed a method for markup such texts, which later makes it possible to create a database of accentuated texts
from them and, based on the entire corpus, to train neural networks for classifying texts according to accentuation systems
and recognizing them in images. As a result of marking texts, in addition to the previously known Alexandrian, Byzantine,
and Dorian systems of accentuation, new accentuation systems were identified, including logical or semantic, with a shift
of an accent mark to the right, with a displacement of an accent mark to the left, and mixed. For each group of monuments,
their variants of using accentuation systems, especially the Alexandrian one, were identified, which show different aspects
of the accentuation of the Byzantine Greek language. When creating a glossary of accentuated word forms based on the
corpus, we determined that there were lexemes that retained their accentuation regardless of the influence of the dialect,
meter, or traditions characteristic of the masters. However, a comparison of identical texts, even found in the same region
of the Byzantine Empire, showed that the accentuation was not replicated when quoting.
m
The paper describes the proposed strategy for evaluation controlled text generation with the sentiment as attribute. Our approach mainly consists of automatic sentiment analysis (ruBERT) and topic modelling (BERTopic),
which are applied to a parallel corpus with artificially produced and human-written texts. The model for evaluation
is fine-tuned on the parsed reviews from big Russian movie-related website ruGPT3Large with the sentiment as
prompt. The results of the analysis demonstrate that the proposed methods can offer a more comprehensive understanding of the advantages and limitations in the context of semantics and sentiment. Additionally, the paper
employs metrics such as BERTscore and self-BLEU to further evaluate the generated text. The proposed methodology provides a novel approach for evaluating the quality of generated text and may have implications for future
studies in the field.
The report introduces a new resource: the Typological Constructicon database. This resource contains an inventory of constructions of selected semantic fields in a number of languages of different areal and genetic affiliation.
The constructions are labeled according to a number of semantic and morphosyntactic parameters and provided with
a detailed description and illustrative examples.
p
In task-oriented dialog systems, conversational agents have the means to plan the dialog to accomplish user
tasks (e.g., order pizza). In chit-chat systems, there are no such straightforward tasks. Yet, in chit-chat dialogs
people still pursue goals, but these goals are more abstract and thus less formalizable. In this work, we describe the
development process of two goal-aware prototypes of a chatbot. The first prototype features entirely human-crafted
scenarios for seven topic-specific (low-level) goals and a Goal Tracker service that detects these goals and monitors
the process of their achievement. The other one combines pre-written utterances with response generation using
DialoGPT model to cover the scenarios of four general (high-level) goals. The results show that introducing the
concept of goals improves performance of a chit-chat dialog system. Qualitative analysis of conversations with the
High-Level goals prototype demonstrates cases where a goal-aware chatbot outperforms the original one.
r
This paper, based on the data from more than 600 languages collected during the work on the database of
semantic shifts in the languages of the world, adresses the semantic transition 'sun'/'day'. We analyze the geographic
and genealogical distribution of this semantic shift, the predominant direction of semantic development, and the
patterns of morphological derivation associated with the shift.
t
In this paper we describe a generative question answering system which relies on text or knowledge graphs to
find supporting evidence. The goal of generative QA is to provide a natural full sentence answer relying on the
relevant evidence. Unlike existing models, the system proposed here can generate full answers using knowledge
base triplets as evidence and is not restricted to simple questions consisting of one triplet. The generation module
is a pretrained encoder-decoder transformer. Additionally, we constructed a new dataset DSberQuAD to train and
evaluate the generative QA system in Russian. The new dataset was constructed in a rule-based manner and is an
extension of SberQuAD with full sentence answers for each question. The proposed model is a new SOTA for
Russian KBQA on RuBQ2.0 dataset. All the code and data from this project are be available on GitHub 1 under
Apache license.
u
This paper presents our work on the development of a morphological analyzer for Siberian Ingrian Finnish.
Siberian Ingrian Finnish is a low-resource language. In this paper, we present an algorithm for analyzing nouns of
Siberian Ingrian Finnish and show an example of analysis.
v
The paper presents the principles of creating a database for research in lexical typology and describes the
possibilities of its use as a linguistic resource. The database is built around semantic fields and frames, i. e. units
of analysis in the frame-based theory of lexical typology.
The database provides a universal format for storing the data; therefore, any project in lexical typology can
be easily added. The database does not only store the data from previous research projects but allows anyone
who wants to contribute to submit data via its web interface. The database includes examples provided by native
speakers and manually annotated with translations, semantic fields, and frames, following the annotation principles
adopted within the frame approach to lexical typology.
z
The study addresses the issue of applying optimizing pre-editing of Russian-language texts in order to improve
the quality of machine translation into English. A probabilistic assessment of translation task complexity is proposed
to be used for selecting a pre-editing strategy. A generalized model of the translation process is presented. A mathematical model and algorithm for automated assessment of translation task complexity are proposed. Test of the model
on specialized texts of oil and gas industry is described, which showed that the estimate correlates with an estimate
of translation quality and can be used in selecting a strategy for optimizing pre-editing of source texts in machine
translation tasks.
In this article, we present a method of anaphoric proper names detection in fictional texts using Word2Vec model
and algorithms of community detection on graphs. This method allows grouping different namings of a single entity
and can be useful as a part of preprocessing texts for further analysis such as building social networks or training
neural models. The method uses large text collection, related to the same domain. The foundation of the method is
training of a Word2Vec model using information on direct characters interactions. This model allows building a social
graph of characters. Than, the Louvain algorithm is used to divide the graph into communities containing different
names of characters related to the same denotation.
.