Until recently, researchers would only consider cross-domain flat NER. In this work, we propose an embarrassingly easy but effective approach to the double challenge of cross-domain nested NER. We use a RuBERT-base
encoder and a Biaffine decoder with CNN block as a backbone nested NER model. The actual approach to crossdomain NER is simple: keep only the common categories between source and target domain datasets, train the
model on a source domain and apply it to a target domain. The results show that proposed method has a drop in
performance compared to the usual training approach, but, unlike latter, does not require any fine-tuning on target
domain data.
Материалы
Сборник (РИНЦ)
Полная версия сборника (РИНЦ)
Студенческая сессия
a
B
Meeting minutes are short texts summarizing the most important outcomes of a meeting. The goal of this work
is to develop a module for automatic generation of meeting minutes based on a meeting transcript text produced
by an Automated Speech Recognition (ASR) system. We consider minuting as a supervised machine learning
task on pairs of texts: the transcript of the meeting and its minutes. No Russian minuting dataset was previously
available. To fill this gap we present DumSum - a dataset of meetings transcripts of the Russian State Duma and
City Dumas, complete with minutes. We use a two-staged minuting pipeline, and introduce semantic segmentation
that improves ROUGE and BERTScore metrics of minutes on City Dumas meetings by 1%-10% compared to naive
segmentation.
G
This paper focuses on a microsyntactic construction “Vsem X-am X” formed by the adjective pronoun and lexical
repition of a noun. A corpus-based research showed that linguistic processors fail to separate such phrasemes from occasional
sequences that only resemble the sought construction. Its idiomatic nature is another reason for mistakes in the work of
linguistic tools, e. g. machine translators. The solution might be in applying a rule-based approach to provide a clear
description of the phraseme.
k
The article represents the preliminary results of the comparison of the syntactic structure of the monologue-stories of 12
informants: 6 typical extroverts and 6 deep introverts — on the topic «How do you spend your free time?» The monologues
are examined from the point of view of the prevalence of predicative units of this or that type as well as the variety of syntactic
links within the monologues. The paper presents the results of the analysis of syntactic features of spontaneous monologuestories in correlation with the speaker psychotype (based on the «sentences», extracted during the experiment on punctuation,
and syntactic relations in the «sentences»).
M
Detailed descriptions of Shughni phonotactics are scarce. It does not come as a surprise: not only Shughni (an
Iranian language spoken in the Pamir Mountains) has a relatively small number of speakers (ca. 100,000) but up until
recently, no databases were facilitating phonological research on it. Now that pamiri.online, a website on Pamir
languages, has been developed, new data can be used for studying the sound patterns of Shughni. This paper illustrates
how pamiri.online can be employed to update and enhance the descriptions of Shughni phonotactics.
This study analyzes the work of the UDPipe and Stanza neural parsers' models trained on corpora
of Universal Dependencies project, the purpose of which is to develop a universal morphological
markup for the corpora of various languages. The paper considers the applicability of the universal
tags’ set to the part-of-speech system of Korean, and compares the UD, KAIST and Sejong tagsets
used by the parsers for Korean. The part-of-speech classification proposed in these tagsets is
analyzed from the point of view of Korean grammar. The paper also evaluates the performance of
the UDPipe and Stanza parsers.
.