Материалы

2023

Сборник (РИНЦ)

Полная версия сборника (РИНЦ)

Формат PDF

Студенческая сессия

Студенческие статьи

Belova D.

Syntax and Prosody of Split Scrambling: an Experimental Approach

The work is devoted to the experimental study of structures with split scrambling. With the help of acceptability assessment methods with the use of the Likert scale, self-paced reading, and prosodic experiment, the possibility of separating the left element (determinator or possessor) from the noun head within DP or PP is compared with the separation of the nominal head from its complement (dependent infinitive or prepositional phrase). The results show that separating the head for Russian speakers is not only possible but also rated higher than separating the left element from the head. This pattern is explained by the requirements of the information structure: the left element that has been fronted cannot form the only topic of the clause. The low scores are consistent with existing experimental studies; however, the results of the reading time appear to be inconsistent with existing views about the cognitive load required to process split sentences.

Biagetti E., Zanchi C., Brigada Villa L., Luraghi S.

Enhancing the semantic and conceptual description of Ancient Greek verbs in WordNet with VerbNet and FrameNet: a treebank-based study

This paper presents a pilot study intended to enhance the semantic and conceptual description of Ancient Greek verbs in WordNet with information from two other resources, VerbNet and FrameNet, and to enrich a treebank of Ancient Greek texts with semantic information extracted from the three resources. We provided semantic annotation for verbs based on their morphosyntactic behavior, and performed a number of queries in order to extract occurrences from the Ancient Greek treebank that intended to match the different meanings of each verb. The manual check of the data extracted shows that, in spite of a limited number of mismatches, our queries yielded reliable results. The queries can be further refined in the future and complemented with a rule-based algorithm to map frame elements to dependency structure.

Bianco A., Roberta Combei C., Zanchi C.

Painting the Senate #Green: A Corpus Study of Twitter Sentiment Towards the Italian Environmentalist Blitz

This study analyzes the reactions of the Italian Twitter community to an environmental demonstration that occurred in Rome on January 2nd, 2023. We compiled a corpus of 368,531 tokens consisting of 11,780 tweets, collected during a 7-day period. We propose a mixed-method approach that combines automated and manual corpus analyses of sentiment, emotions, and implicit language. Our findings offer insights into how tweets reflected the users’ attitudes toward a variety of subjects and entities. Although the sentiment of the overall debate was distributed rather evenly, the incident itself seems to have sparked negative sentiment and emotions among Twitter users. The results of our manual analyses revealed some issues with respect to the automatic classification of sentiment, due to the fact that some tweets contained irony, sarcasm, and slurs. Non-literal interpretations were ignored by the tools at hand that could not account for complex rhetorical-argumentative strategies.

Bolshakov V., Kolobov R., Borisov E., Mikhaylovskiy N., Mukhtarova G.

Scaled Down Lean BERT-like Language Models for Anaphora Resolution and Beyond

We study performance of BERT-like distributive semantic language models on anaphora resolution and related tasks with the purpose of selecting a model for on-device inference. We have found that lean (narrow and deep) language models provide the best balance of speed and quality for word-level tasks, and opensource1 RuLUKE-tiny and RuLUKE-slim models we have trained. Both are significantly (over 27%) faster than models with comparable accuracy. We hypothesise that the model depth may play a critical role for performance as, according to recent findings each layer behaves as a gradient descent step in autoregressive setting.

Budennaya E., Bazhukov M., Barkova L., Kharlamova D., Dugrichilov A., Reznikova T., Yakovleva A., Litvintseva K., Andreeva A.

Diachronicon: a new resource for the study of Russian constructions in a microdiachronic perspective

The article is devoted to the linguistic characteristics of the database "Diachronicon" and describes the features of the diachronic markup of Russian language constructions, as well as tags specially designed for searching through a diachronic database. A special comment field used in the database is separately justified. In addition, the computer interface of the “Diachronicon" is presented and described. The developed resource provides extensive opportunities for systematic study of not only specific constructions, but also general mechanisms of idiomatization and grammaticalization. The database allows the researcher to simultaneously compare several separate plots, search through a list of constructions and their characteristics in diachrony, track the history of syntactic and semantic changes and limitations of compatibility of different constructions.

Buyanov I., Yaskova D., Sochenkov I.

Who is answering to whom? Modeling reply-to relationships in Russian asynchronous chats

The study highlights the asynchronous nature of modern group chats and related problems such as retrieving relevant information on the asked question and understanding reply-to relationships. In this work, we formalize the reply recovery task as a building block toward solving described problems. Using simple heuristics, we try to apply the result reply recovery model to a thread reconstruction problem. As a result, we show that modern pre-trained models such as BERT show great results on the task of reply recovery compared to more simple models, though it cannot be applied to thread reconstruction with just simple heuristics. In addition, experiments have shown that model performance depends on the chat domain. We open-sourced a model that can automatically predict which message the particular reply responds to and provide a representative Russian dataset that we built from Telegram chats of different domains. We also provide a test set for a thread reconstruction task.

Chaga A.

Binary classification model as a tool to detect sentences with microsyntactic units

We consider a model of binary classifier predicting occurrence of microsyntactic units in sentences. The model is based on AWD-LSTM architecture with an encoder pre-trained on the Russian version of Wikipedia and further trained on a dataset built from the SynTagRus corpus supplied with a microsyntactic markup. We present the structure of the model and discuss its output. The study showed that binary classification allows targeting of microsyntactic markup and helps to significantly improve its recall.

Evdokimova A.

Corpus of Accentuated Byzantine Written Monuments and Methods of Its Markup

This article presents a corpus of Byzantine accentuated texts (BGAT) created since 2008. It currently includes 1010 Byzantine inscriptions, 950 papyri from various collections from the 1st to the 9th centuries, 132 seals from the collection of Dumbarton Oaks, and a selection of 100 Athos manuscripts from the 8th to the 15th centuries. Based on the collected data, we developed a method for markup such texts, which later makes it possible to create a database of accentuated texts from them and, based on the entire corpus, to train neural networks for classifying texts according to accentuation systems and recognizing them in images. As a result of marking texts, in addition to the previously known Alexandrian, Byzantine, and Dorian systems of accentuation, new accentuation systems were identified, including logical or semantic, with a shift of an accent mark to the right, with a displacement of an accent mark to the left, and mixed. For each group of monuments, their variants of using accentuation systems, especially the Alexandrian one, were identified, which show different aspects of the accentuation of the Byzantine Greek language. When creating a glossary of accentuated word forms based on the corpus, we determined that there were lexemes that retained their accentuation regardless of the influence of the dialect, meter, or traditions characteristic of the masters. However, a comparison of identical texts, even found in the same region of the Byzantine Empire, showed that the accentuation was not replicated when quoting.

Margolina A., Kolmogorova A.

Exploring Evaluation Techniques in Controlled Text Generation: A Comparative Study of Semantics and Sentiment in ruGPT3large-Generated and Human-Written Movie Reviews

The paper describes the proposed strategy for evaluation controlled text generation with the sentiment as attribute. Our approach mainly consists of automatic sentiment analysis (ruBERT) and topic modelling (BERTopic), which are applied to a parallel corpus with artificially produced and human-written texts. The model for evaluation is fine-tuned on the parsed reviews from big Russian movie-related website ruGPT3Large with the sentiment as prompt. The results of the analysis demonstrate that the proposed methods can offer a more comprehensive understanding of the advantages and limitations in the context of semantics and sentiment. Additionally, the paper employs metrics such as BERTscore and self-BLEU to further evaluate the generated text. The proposed methodology provides a novel approach for evaluating the quality of generated text and may have implications for future studies in the field.

Muravyev N., Gordeev N., Makarchuk I., Kukushkina M., Buzanov A.

The Typological constructicon database

The report introduces a new resource: the Typological Constructicon database. This resource contains an inventory of constructions of selected semantic fields in a number of languages of different areal and genetic affiliation. The constructions are labeled according to a number of semantic and morphosyntactic parameters and provided with a detailed description and illustrative examples.

Petukhova K., Smilga V., Zharikova D.

Abstract User Goals in Open-Domain Dialog Systems

In task-oriented dialog systems, conversational agents have the means to plan the dialog to accomplish user tasks (e.g., order pizza). In chit-chat systems, there are no such straightforward tasks. Yet, in chit-chat dialogs people still pursue goals, but these goals are more abstract and thus less formalizable. In this work, we describe the development process of two goal-aware prototypes of a chatbot. The first prototype features entirely human-crafted scenarios for seven topic-specific (low-level) goals and a Goal Tracker service that detects these goals and monitors the process of their achievement. The other one combines pre-written utterances with response generation using DialoGPT model to cover the scenarios of four general (high-level) goals. The results show that introducing the concept of goals improves performance of a chit-chat dialog system. Qualitative analysis of conversations with the High-Level goals prototype demonstrates cases where a goal-aware chatbot outperforms the original one.

Russo M.

Representation of lexical polysemy in the database (semantic shift 'sun/day')

This paper, based on the data from more than 600 languages collected during the work on the database of semantic shifts in the languages of the world, adresses the semantic transition 'sun'/'day'. We analyze the geographic and genealogical distribution of this semantic shift, the predominant direction of semantic development, and the patterns of morphological derivation associated with the shift.

Turganbay R., Surkov V., Evseev D., Drobyshevskiy M.

Generative Question Answering Systems over Knowledge Graphs and Text

In this paper we describe a generative question answering system which relies on text or knowledge graphs to find supporting evidence. The goal of generative QA is to provide a natural full sentence answer relying on the relevant evidence. Unlike existing models, the system proposed here can generate full answers using knowledge base triplets as evidence and is not restricted to simple questions consisting of one triplet. The generation module is a pretrained encoder-decoder transformer. Additionally, we constructed a new dataset DSberQuAD to train and evaluate the generative QA system in Russian. The new dataset was constructed in a rule-based manner and is an extension of SberQuAD with full sentence answers for each question. The proposed model is a new SOTA for Russian KBQA on RuBQ2.0 dataset. All the code and data from this project are be available on GitHub 1 under Apache license.

Ubaleht I.

Development of a Morphological Analyser for Siberian Ingrian Finnish

This paper presents our work on the development of a morphological analyzer for Siberian Ingrian Finnish. Siberian Ingrian Finnish is a low-resource language. In this paper, we present an algorithm for analyzing nouns of Siberian Ingrian Finnish and show an example of analysis.

Voloshina E., Leonova P.

The Universal Database for Lexical Typology

The paper presents the principles of creating a database for research in lexical typology and describes the possibilities of its use as a linguistic resource. The database is built around semantic fields and frames, i. e. units of analysis in the frame-based theory of lexical typology. The database provides a universal format for storing the data; therefore, any project in lexical typology can be easily added. The database does not only store the data from previous research projects but allows anyone who wants to contribute to submit data via its web interface. The database includes examples provided by native speakers and manually annotated with translations, semantic fields, and frames, following the annotation principles adopted within the frame approach to lexical typology.

Zhivotova A., Berdonosov V.

Pre-editing Strategy Based on Automatic Evaluation of Translation Complexity to Improve the Quality of Specialized Texts Machine Translation into English

The study addresses the issue of applying optimizing pre-editing of Russian-language texts in order to improve the quality of machine translation into English. A probabilistic assessment of translation task complexity is proposed to be used for selecting a pre-editing strategy. A generalized model of the translation process is presented. A mathematical model and algorithm for automated assessment of translation task complexity are proposed. Test of the model on specialized texts of oil and gas industry is described, which showed that the estimate correlates with an estimate of translation quality and can be used in selecting a strategy for optimizing pre-editing of source texts in machine translation tasks.

Zykova V., Klyshinsky E.

Remus, Lupin and Moony Walk in a Bar… Grouping of Proper Names Related to the Same Denotation in Large Literary Texts Collections

In this article, we present a method of anaphoric proper names detection in fictional texts using Word2Vec model and algorithms of community detection on graphs. This method allows grouping different namings of a single entity and can be useful as a part of preprocessing texts for further analysis such as building social networks or training neural models. The method uses large text collection, related to the same domain. The foundation of the method is training of a Word2Vec model using information on direct characters interactions. This model allows building a social graph of characters. Than, the Louvain algorithm is used to divide the graph into communities containing different names of characters related to the same denotation.