Студенческая сессия

Agliullin Kamil

A Simple but Effective Approach to Cross-domain Nested Named Entity Recognition

Until recently, researchers would only consider cross-domain flat NER. In this work, we propose an embarrassingly easy but effective approach to the double challenge of cross-domain nested NER. We use a RuBERT-base encoder and a Biaffine decoder with CNN block as a backbone nested NER model. The actual approach to crossdomain NER is simple: keep only the common categories between source and target domain datasets, train the model on a source domain and apply it to a target domain. The results show that proposed method has a drop in performance compared to the usual training approach, but, unlike latter, does not require any fine-tuning on target domain data.

Borisov Eugene, Mikhaylovskiy N.

Automated Minuting on DumSum Dataset

Meeting minutes are short texts summarizing the most important outcomes of a meeting. The goal of this work is to develop a module for automatic generation of meeting minutes based on a meeting transcript text produced by an Automated Speech Recognition (ASR) system. We consider minuting as a supervised machine learning task on pairs of texts: the transcript of the meeting and its minutes. No Russian minuting dataset was previously available. To fill this gap we present DumSum - a dataset of meetings transcripts of the Russian State Duma and City Dumas, complete with minutes. We use a two-staged minuting pipeline, and introduce semantic segmentation that improves ROUGE and BERTScore metrics of minutes on City Dumas meetings by 1%-10% compared to naive segmentation.

Gofman Sergey

О микросинтакической конструкции, образованной по модели «всем Х-ам Х»

This paper focuses on a microsyntactic construction “Vsem X-am X” formed by the adjective pronoun and lexical repition of a noun. A corpus-based research showed that linguistic processors fail to separate such phrasemes from occasional sequences that only resemble the sought construction. Its idiomatic nature is another reason for mistakes in the work of linguistic tools, e. g. machine translators. The solution might be in applying a rule-based approach to provide a clear description of the phraseme.

Kostina Ekaterina

Features of the Syntax of Spontaneous Monologues Stories About Rest In Russian: Psycholinguistic Aspect

The article represents the preliminary results of the comparison of the syntactic structure of the monologue-stories of 12 informants: 6 typical extroverts and 6 deep introverts — on the topic «How do you spend your free time?» The monologues are examined from the point of view of the prevalence of predicative units of this or that type as well as the variety of syntactic links within the monologues. The paper presents the results of the analysis of syntactic features of spontaneous monologuestories in correlation with the speaker psychotype (based on the «sentences», extracted during the experiment on punctuation, and syntactic relations in the «sentences»).

Makarov Yury

Computational Study of Shughni Phonotactics

Detailed descriptions of Shughni phonotactics are scarce. It does not come as a surprise: not only Shughni (an Iranian language spoken in the Pamir Mountains) has a relatively small number of speakers (ca. 100,000) but up until recently, no databases were facilitating phonological research on it. Now that pamiri.online, a website on Pamir languages, has been developed, new data can be used for studying the sound patterns of Shughni. This paper illustrates how pamiri.online can be employed to update and enhance the descriptions of Shughni phonotactics.

Mosolova Alissa

Comparative analysis of neural network-based morphological parsers of Korean language

This study analyzes the work of the UDPipe and Stanza neural parsers' models trained on corpora of Universal Dependencies project, the purpose of which is to develop a universal morphological markup for the corpora of various languages. The paper considers the applicability of the universal tags’ set to the part-of-speech system of Korean, and compares the UD, KAIST and Sejong tagsets used by the parsers for Korean. The part-of-speech classification proposed in these tagsets is analyzed from the point of view of Korean grammar. The paper also evaluates the performance of the UDPipe and Stanza parsers.

Материалы

Сборник (РИНЦ)

Полная версия сборника (РИНЦ)

Студенческая сессия