SYSTEM FOR EXTRACTING SEMANTIC INFORMATION FROM NATURAL LANGUAGE TEXT
Igor Petrovich Kuznetsov
The Institute for Informatics Problems of the Russian Academy of Sciences
Andrey Grigorievich Matskevich
The Institute for Informatics Problems of the Russian Academy of Sciences
igor-kuz@mtu-net.ru
Key words: Knowledge extraction, Semantic networks, Analytical processing.
A modern system extracting the significant information (objects with attributes and links, groups of objects composing the events) from free text on natural language is considered. This information is represented in the knowledge base (KB) in the form of semantic networks and is processed at the level of networks. System uses KB for analytical processing texts and fuzzy search. For discovering in texts the significant and analytical information the system uses special semantic filters. Methods of discovery and of analytical processing are considered. The system has been applied for the logical-analytical tasks of accident reports processing. System can be tuned to another application by changing a linguistic knowledge to indicate the significant objects, links and contexts. The system was tuned to texts in Russian about commercial banks to extract significant information about them and to determine the bank range. Another application is connected with DB. System can read free texts and fill the empty fields of DB.
- Introduction
The system considered in this paper has many common features and tasks with FASTUS [1]. They were designed in parallel and independently at the same period. Our system is also oriented at users who are interested in the semantically significant information which can be expressed in free texts in many ways. This information represents objects and relations, for example, persons, their names, surnames, addresses, telephones, as well as organizations, banks, equipments, their qualitative and quantitative data and so on. We will consider that every attribute can have many words which describe some aspects of object or events, for example, address of a person, place of event. Linked objects compose the events and situations.
Our system (which differs it from FASTUS) extracts the objects and their attributes out of big-scale texts in natural language with the aim to form a big knowledge base (KB) and to provide the analytical processing of knowledge at the level of KB using concepts, attributes and links. For representation of information in KB special semantic networks are used [2]. For processing them the new methods and tool of text mining, logical knowledge analysis and fuzzy search at the level of KB are proposed.
The system has been applied in Moscow criminal police for search of the criminals, accidents and for analytical decisions on the base of criminal information: accident reports, word portraits of persons and their telephone books. The system divides a report into parts (documents) which describe independent events. For every part the system forms its own semantic networks presenting significant information. These parts are called the content portraits of events or documents. They are stored in big-scale data base (DB) and are selected in the process of search and analysis. As a result the operative KB will be composed.
The system has a thesaurus to extend the retrieval space and has a linguistic knowledge for significant information discovery. They are presented in the form of semantic networks too. All kinds of processing are fulfilled at the level of semantic networks by programs which were designed by means of special tools DECL. Programs consist of production rules and oriented at semantic networks transformation.
The system can be tuned to another application by indication of significant objects, links and modification of the thesaurus and the linguistic knowledge. We used the system for analysis of texts about commercial banks to extract significant information about them and to determine the bank ranges. Another wide application is connected with DB. The system can read free texts and fill the empty fields of DB.
Now we designed English version for demonstration of system possibilities.
- Content portraits of documents
Content portraits of documents are semantic networks which represented the significant objects, their attributes and links. Semantic networks consist of elementary fragments which are N-place predicates with indication of their codes. If a predicate corresponds to the relation between objects then its code corresponds to all these considered as a whole. The codes may be in argument places of other fragments. Therefore a fragment is a broader concept than a predicate in logic. Such semantic networks can represent the combined information of various degrees of complicity.
For selection of significant objects and their attributes from free text the linguistic processor (LP) is used. At the beginning LP transforms all words to a normal form. For example, for Russian nouns it's the single number and nominative case, for verbs - the infinitive form and so on.
After that the LP seeks the words which indicate an object or an attribute presence. For example, the words ADDRESS, LIVE, STREET,... indicate the presence of a person's address. The LP determines the border of an attribute by linguistic knowledge where the possible words of these attributes and their forms are indicated. For example, it may be the number or indication that a word begins with a capital letter and so on.
The objects and attributes are divided into two classes. One has a fixed number of positions. For example, it may be the full name of a person, date of accident. The second class has the infixed number of positions which can be restricted by indication of maximal quantity of words, for example, address, person's features. For them linguistic knowledge determines the possible words in the beginning and the ending. In dependence of class the LP changes border and takes words inside it. LP differenciates the obligatory and auxiliary words and takes them into account too. Linguistic knowledge determines the words position and their rearranging inside attributes, possible distance between words and so on [3]. It provides text mining with extracting the significant information.
As a result of analysis the LP forms the semantic networks which present the content portraits of a document.
Example 1.
Document: "Professor Kuznetsov Igor; he has the height about 175-180, looking 50 years old, works at the Russian Academy of Sciences, designs systems in the field of Artificial Intelligence."
Its content portrait is:
DOC(24,TEXT)
NAME(0+,KUZNETSOV,IGOR,??,1)
POSITION(0-,PROFESSOR/1+)
FEATURE(0-,HEIGHT,175,180,LOOK,50,YEAR,OLD/2+)
WORK_PLACES(0-,IN,RUSSIA,ACADEMY,OF,SCIENCE/3+)
SENTENCE(24,1-,2-,3-,DESIGN,SYSTEM,FIELD,OF,ARTIFICIAL,INTELLIGENCE)
Fragment DOC(24,TEXT) indicates that the document has number 24 in the report. Document consists of the sentence where the LP selected a person. Fragment NAME(0+,KUZNETSOV,IGOR,??,1) has the fixed position where 0+ is the inner code of a person in KB, the sign ?? indicates the undefined second name, and figure 1 indicates the number of persons. Another word combination TWO UNKNOWN PERSONS will be represented in KB by the fragment NAME(10+,??,??,??,2), where 10+ is the system person's code. Fragments POSITION(0-,PROFESSOR/1+) FEATURE(0-,.../2+) and WORK_PLACE(0-,.../3+) indicate the attributes of Kuznetsov Igor, 0- is his code used for the second time. Signs 1+ ,2+ and 3+ are the codes of fragments. They are used in the fragment SENTENCE(24,1-,2-,...) to indicate position of attributes in the sentence.
- Analytical fragments
When a policeman is seeking the similar events or accidents he takes into account many factors indicating the crime action, the kind of crime committing, mode of penetration and so on. He uses correspondent classification. This information may be implied. It may be absent in document in explicit form. For its extraction a method based on semanic filters was proposed. It uses the fragments presenting semantic spaces of words (free synonyms, context dependent synonyms, words with close meaning or contrary in meaning), the SUB-tree presenting various classifications and the fragments playing the role of the semantic filter. For example, the fragment
WORD(CLOTHES,COLOR,CLOTHES)
indicates the following. If words about some color and clothes are occupying adjacent places in a sentence then the word combination describes clothes. The system will look through the semantic spaces of the two words and will analyze the distance between them. Moreover it's possible to set the strong order of words in fragment or their free positions.
In a criminal system these fragments are used to combine words, to select the word combination, to restore the implied information and to estimate the document in correspondent of accident classifications.
Example 2.
The 17-th document in report: "two corpses of Caucasian men were found on the seats of car VAZ-2109. The analysis shows that their death was caused by firearm wounding. At the place of the crime the cartridge-case of pistol TT was found".
The analytical fragments:
ANALYTIC(17,"Crime action",WOUND,FIREARM)
ANALYTIC(17,PERSON,NATIONALITY,CAUCASIAN)
ANALYTIC(17,ARMS,PISTOL,TT,CAR,VAZ-2109)
where every word is either a name of a class or specifies the previous word.
A fragment can be transformed by the system into the natural language form:
Crime action: WOUND (FIREARM)
PERSON: NATIONALITY (CAUCASIAN)
ARMS: PISTOL (TT)
CAR: VAZ-2109
Analytical fragments play significant role in the search of similar persons and events.
- Features of search
The system uses a method of fuzzy search based on the weight of significant attributes and on variation of words in the frame of their semantic spaces [3].
The search of similar object and events is caused by a question which was transformed into the semantic network presenting its content portrait. The system extracts from it the significant words and attributes which become the signs (indications) for search which consists in checking the presence of these question signs in documents.
The system derives and takes into account the following signs:
- primary signs (significant question words in normal form);
- secondary signs (synonyms of the first words, words with close meaning, explaining words and so on) which are derived from primary signs by thesaurus;
- analytical signs (for example, crime actions, kind of crime committing, mode of penetration and so on) taken from analytical fragments;
- contradictory or alternative signs which are derived by thesaurus.
A sign may be a word combination selected by LP. For example, "clear eyes" which indicate that CLEAR relates to the word EYES. It decreases the noise in the search.
A question may be expressed as a text of a user or some document in free form. For example, some text about an accident or a word portrait of some person may play the role of the question. The system will match this text with information in KB at the content level.
The search is fuzzy because it doesn't demand the exact coincidence of all signs. The system finds only the common features of a question and documents and the degree of their proximity.
The search consists in detailed analysis of signs in content portraits of a question and the loaded documents. The system tries to match them and to count more precisely the weight of every document. For this aim the system takes into account the following
- coincidence of the first and the second signs with their weight;
- contradictory signs;
- strong coincidence when the document has many signs of question (words, word combinations) which are related to the same attribute and not far from each other;
- full coincidence when some attribute of a question and the document contain the same address or a car number or person's data;
- the number is included in the interval, for example, height 182 is included in 180-190;
- intersection of intervals;
- the nearest of numbers because the height of a person in question and documents can have small difference.
A user can control the search by special symbols in question. For example, symbol @ after a word means that it's an obligatory sign. The question with IVANOV@ IVAN@ will cause the search of documents with words IVANOV IVAN.
- Analytical tasks
When documents are presented in the form of semantic networks and are loaded in KB the system can decide various analytical tasks. For example, the criminal system can find the links between persons and select the organized groups. Links are found in the following way. Two persons may be linked if they met in one accident (document) or if they took part in various accidents where same telephone or address and so on were found. On the base of person's links the system forms the graph which is put out to user. Users are looking through all persons and their links and can put in and out the information about them in a comfortable form. The user can pass one person to other and analyze their link which can be the direct and indirect: link through some other person.
The system can be applied for search of the sentences which have the nearest meanings and documents in which they are used. In this case the system can be tuned to divide documents into small parts which are sentences. It is a significant task for many applications.
Other analytical tasks are connected with object identification, counting the range of objects, their comparison and so on. For decision of these tasks and other ones the program in the language DECL was designed. DECL is oriented at structure processing and inference. Our practice shows that a user can design the analytical programs in DECL in shorter time than by other tools.
References
- FASTUS:a Cascaded Finite-State Trasducer for Extracting Information from Natural-Language Text. // AIC, SRI International. Menlo Park. California, 1996.
- Kuznetsov Igor. Semantic Representations. Moscow: Science, 1978. 294 p. (in Russian).
- Kuznetsov Igor. Methods of report processing which reveal the characteristics of figurants and incidents. International workshop // "Dialogue'98": Computational Linguistic and its applications. Vol2. Kazan, 1998. P. 961-700.
Система извлечения семантической информации из
естественно-языкового текста
И. П. Кузнецов, А. Г. Мацкевич
Ключевые слова: извлечение знаний, семантические сети, аналитическая обработка.
В работе описывается современная система извлечения важной содержательной информации (объектов с атрибутами и связями, групп объектов, составляющих события) из естественно-языкового текста произвольной формы. Эта информация представляется в базе знаний в виде семантических сетей и обрабатывается на уровне сетей. Система использует базу знаний (БЗ) для аналитической обработки и нечеткого поиска. Для выявления в текстах важной и аналитической информации система использует специальные семантические фильтры. Рассматриваются методы, используемые для выявления информации и ее аналитической обработки. Система применялась для логико-аналитических задач обработки отчетов о происшествиях. Система может быть настроена на другие приложения путем лингвистических знаний для обозначения важных объектов, связей и контекстов. Система была настроена на русскоязычные тексты банковской тематики для извлечения важной информации и ранжирования банков. Еще одно приложение связано с базами данных. Система может заполнять поля базы данных информацией, извлеченной из произвольного текста на естественном языке.