Singularikа creates comprehensive solutions in the field of information extraction and completes particular tasks in this sphere.
What is information extraction?
It is automated extraction of structured information from unstructured or semi-structured data. It implies defining objects, their relations, and characteristics in texts. As a rule, the task is to analyze a set of documents in natural language, extract required information, structure and record it to a database. We can extract from text events, terminology, emotional evaluations, named entities (e.g., names, organizations, locations) and other data.
Examples of information extraction tasks
1. When a new brand or a new product is released, we need to collect feedback and understand what consumers think about this new brand, how its position can be improved. It can be done with the help of information extraction.
2. Recruiters constantly analyze immense quantity of CV and resumes. The workload can be reduced by weeding out part of documents using information extraction. We extract required entities from resume and leave for requiter’s review only those which meet some initial requirements.
The process of analysis
The text in natural language is analyzed at all linguistic levels:
Each level has its difficulties which are overcome by particular methods.
Using specific algorithms we get from an unstructured text where all required objects and facts are marked up and categorized.
The central point of analysis is the extraction of facts and entities.
Approaches to extraction
There are three main approaches to extraction of facts and entities.
- based on machine learning.
Ontologies, in this case, mean conceptual dictionaries, containing a description of certain objects, notions, their characteristics, their relations. Depending on the tasks, universal, industry-specific or highly specialized ontologies are utilized. Also the ontologies of objects – databases – are widely used. The bright example of such ontology is Wikipedia.
Ontology-based information extraction provides highly accurate named entity recognition and absence of accidental operation. Its main disadvantage is a low level of completeness: you can extract only the information already present in the ontology. And you have to add the objects to the ontology manually or build automatic addition procedure.
Rule-based approach implies writing templates manually. Analytics create description of the information to be extracted. The advantage of this approach is that if there are any mistakes found in analysis results, it is easy to find the problem and correct the rules. The rule-based approach is mostly used to extract standardized objects: names, dates, companies.
Machine Learning approach requires inserting a large amount of data. It is necessary to elaborate the training texts: to mark up the morphology, syntactics, semantics, ontological connections,
This approach has following advantages:
- No manual work except data corpus creation is required.
- Such system can be easily re-trained and re-adjusted.
- The rules, in this case, are more abstract.
The drawbacks are:
- Very limited set of automatic tagging tools for many languages.
- The data corpus has to be very thoroughly prepared which is time-consuming.
- If an error occurs, it is difficult to localize and correct without making changes to the system on the whole.