Archiwum autora: Marcin

O Marcin

Projektant, programista fascynat

Application for text processing based on our REST API which integrates four language models

We are building our own application for text processing. It is based on our REST API which integrates four language models. Our API will enable learning and testing. It is simple, compact and ready to use. Our API will help you avoid the time consuming configuration of many language models. Thus you can develop your own solutions and applications.

The OPI Toolkit for NLP is:

  • multilingual, i.e. it enables analysis of documents written in different languages: Polish, English, German, and French,
  • ready to learning and testing, because it is simple: you can quickly prototype and develop own solution based on our API,
  • compact: you can spend time resolving real problems rather than wasting it on configuration and implementation of NLP basic functionalities. Now these features are available immediately.

Check it in Inventorum NLP Tools website

Application for text processing based on our REST API which integrates four language models

We are building our own application for text processing. It is based on our REST API which integrates four language models. Our API will enable learning and testing. It is simple, compact and ready to use. Our API will help you avoid the time consuming configuration of many language models. Thus you can develop your own solutions and applications.

The OPI Toolkit for NLP is:

  • multilingual, i.e. it enables analysis of documents written in different languages: Polish, English, German, and French,
  • ready to learning and testing, because it is simple: you can quickly prototype and develop own solution based on our API,
  • compact: you can spend time resolving real problems rather than wasting it on configuration and implementation of NLP basic functionalities. Now these features are available immediately.

Check it in Inventorum NLP Tools website

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

The aim of this study is to propose an information extraction system, called BigGrams, which is able to retrieve relevant and structural information (relevant phrases, keywords) from semi-structural web pages, i.e. HTML documents. For this purpose, a novel semi-supervised wrappers induction algorithm has been developed and embedded in the BigGrams system. The wrappers induction algorithm utilizes a formal concept analysis to induce information extraction patterns. Also, in this article, the author (1) presents the impact of the configuration of the information extraction system components on information extraction results and (2) tests the boosting mode of this system. Based on empirical research, the author established that the proposed taxonomy of seeds and the HTML tags level analysis, with appropriate pre-processing, improve information extraction results. Also, the boosting mode works well when certain requirements are met, i.e. when well-diversified input data are ensured.

Artykuł – The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

W Knowledge and Information Systems ukazał się artykuł The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction. Podążając za abstraktem – The aim of this study is to propose an information extraction system, called BigGrams, which is able to retrieve relevant and structural information (relevant phrases, keywords) from semi-structural web pages, i.e. HTML documents. For this purpose, a novel semi-supervised wrappers induction algorithm has been developed and embedded in the BigGrams system. The wrappers induction algorithm utilizes a formal concept analysis to induce information extraction patterns. Also, in this article, the author (1) presents the impact of the configuration of the information extraction system components on information extraction results and (2) tests the boosting mode of this system. Based on empirical research, the author established that the proposed taxonomy of seeds and the HTML tags level analysis, with appropriate pre-processing, improve information extraction results. Also, the boosting mode works well when certain requirements are met, i.e. when well-diversified input data are ensured.

Z artykułem można zapoznać się w dziale Publikacje i jak zawsze życzę miłej lektury.

Categorization of Multilingual Scientific Documents by a Compound Classification System

The aim of this study was to propose a classification method for documents that include simultaneously text parts in various languages. For this purpose, we constructed a three-leveled classification system. On its first level, a data processing module prepares a suitable vector space model. Next, in the middle tier, a set of monolingual or multilingual classifiers assigns the probabilities of belonging each document or its parts to all possible categories. The models are trained by using Multinomial Naive Bayes and Long Short-Term Memory algorithms. Finally, in the last component, a multilingual decision module assigns a target class to each document. The module is built on a logistic regression classifier, which as the inputs receives probabilities produced by the classifiers. The system has been verified experimentally. According to the reported results, it can be assumed that the proposed system can deal with textual documents which content is composed of many languages at the same time. Therefore, the system can be useful in the automatic organizing of multilingual publications or other documents.

Artykuł – Categorization of Multilingual Scientific Documents by a Compound Classification System

Niedługo rusza konferencja The 16th International Conference on Artificial Intelligence and Soft Computing ICAISC 2017, Zakopane, Poland, June 11-15, 2017. Na ww. konferencję został zgłoszony i zaakceptowany artykuł dotyczący klasyfikacji dokumentów wielojęzycznych. Podążając za abstraktem – The aim of this study was to propose a classification method for documents that include simultaneously text parts in various languages. For this purpose, we constructed a three-leveled classification system. On its first level, a data processing module prepares a suitable vector space model. Next, in the middle tier, a set of monolingual or multilingual classifiers assigns the probabilities of belonging each document or its parts to all possible categories. The models are trained by using Multinomial Naive Bayes and Long Short-Term Memory algorithms. Finally, in the last component, a multilingual decision module assigns a target class to each document. The module is built on a logistic regression classifier, which as the inputs receives probabilities produced by the classifiers. The system has been verified experimentally. According to the reported results, it can be assumed that the proposed system can deal with textual documents which content is composed of many languages at the same time. Therefore, the system can be useful in the automatic organizing of multilingual publications or other documents.

Z artykułem można zapoznać się w dziale Publikacje i jak zawsze życzę miłej lektury.

Article – Detection of the Innovative Logotypes on the Web Pages

The aim of this study was to describe a found method for detection of logotypes that indicate innovativeness of companies, where the images originate from their Internet domains. For this purpose, we elaborated a system that covers a supervised and heuristic approach to construct a reference dataset for each logotype category that is utilized by the logistic regression classifiers to recognize a logotype category. We proposed the approach that uses one-versus-the-rest learning strategy to learn the logistic regression classification models to recognize the classes of the innovative logotypes. Thanks to this we can detect whether a given company’s Internet domain contains an innovative logotype or not. More- over, we find a way to construct a simple and small dimension of feature space that is utilized by the image recognition process. The proposed feature space of logotype classification models is based on the weights of images similarity and the textual data of the images that are received from HTMLs ALT tags.

Artykuł – Detection of the Innovative Logotypes on the Web Pages

Niedługo rusza konferencja The 16th International Conference on Artificial Intelligence and Soft Computing ICAISC 2017, Zakopane, Poland, June 11-15, 2017. Na ww. konferencję został zgłoszony i zaakceptowany artykuł dotyczący klasyfikacji logotypów. Podążając za abstraktem – The aim of this study was to describe a found method for detection of logotypes that indicate innovativeness of companies, where the images originate from their Internet domains. For this purpose, we elaborated a system that covers a supervised and heuristic approach to construct a reference dataset for each logotype category that is utilized by the logistic regression classifiers to recognize a logotype category. We proposed the approach that uses one-versus-the-rest learning strategy to learn the logistic regression classification models to recognize the classes of the innovative logotypes. Thanks to this we can detect whether a given company’s Internet domain contains an innovative logotype or not. More- over, we find a way to construct a simple and small dimension of feature space that is utilized by the image recognition process. The proposed feature space of logotype classification models is based on the weights of images similarity and the textual data of the images that are received from HTMLs ALT tags.

Z artykułem można zapoznać się w dziale Publikacje i jak zawsze życzę miłej lektury.

Article – A Diversified Classification Committee for Recognition of Innovative Internet Domains

The objective of this paper was to propose a classification method of innovative domains on the Internet. The proposed approach helped to estimate whether companies are innovative or not through analyzing their web pages. A Naïve Bayes classification committee was used as the classification system of the domains. The classifiers in the committee were based concurrently on Bernoulli and Multinomial feature distribution models, which were selected depending on the diversity of input data. Moreover, the information retrieval procedures were applied to find such documents in domains that most likely indicate innovativeness. The proposed methods have been verified experimentally. The results have shown that the diversified classification committee combined with the information retrieval approach in the preprocessing phase boosts the classification quality of domains that may represent innovative companies. This approach may be applied to other classification tasks.

Artykuł – A Diversified Classification Committee for Recognition of Innovative Internet Domains

Niedługo rusza następna edycja 12th International Conference, BDAS 2016, Ustroń, Poland, May 31 – June 3, 2016. Na ww. konferencję został zgłoszony artykuł dotyczący klasyfikacji domen Internetowych. Podążając za abstraktem – The objective of this paper was to propose a classification method of innovative domains on the Internet. The proposed approach helped to estimate whether companies are innovative or not through analyzing their web pages. A Naïve Bayes classification committee was used as the classification system of the domains. The classifiers in the committee were based concurrently on Bernoulli and Multinomial feature distribution models, which were selected depending on the diversity of input data. Moreover, the information retrieval procedures were applied to find such documents in domains that most likely indicate innovativeness. The proposed methods have been verified experimentally. The results have shown that the diversified classification committee combined with the information retrieval approach in the preprocessing phase boosts the classification quality of domains that may represent innovative companies. This approach may be applied to other classification tasks.

Z artykułem można zapoznać się w dziale Publikacje i jak zawsze życzę miłej lektury.