The aim of this study was to propose a classification method for documents that include simultaneously text parts in various languages. For this purpose, we constructed a three-leveled classification system. On its first level, a data processing module prepares a suitable vector space model. Next, in the middle tier, a set of monolingual or multilingual classifiers assigns the probabilities of belonging each document or its parts to all possible categories. The models are trained by using Multinomial Naive Bayes and Long Short-Term Memory algorithms. Finally, in the last component, a multilingual decision module assigns a target class to each document. The module is built on a logistic regression classifier, which as the inputs receives probabilities produced by the classifiers. The system has been verified experimentally. According to the reported results, it can be assumed that the proposed system can deal with textual documents which content is composed of many languages at the same time. Therefore, the system can be useful in the automatic organizing of multilingual publications or other documents.
Archiwum kategorii: About everything-anything @en
Article – Detection of the Innovative Logotypes on the Web Pages
The aim of this study was to describe a found method for detection of logotypes that indicate innovativeness of companies, where the images originate from their Internet domains. For this purpose, we elaborated a system that covers a supervised and heuristic approach to construct a reference dataset for each logotype category that is utilized by the logistic regression classifiers to recognize a logotype category. We proposed the approach that uses one-versus-the-rest learning strategy to learn the logistic regression classification models to recognize the classes of the innovative logotypes. Thanks to this we can detect whether a given company’s Internet domain contains an innovative logotype or not. More- over, we find a way to construct a simple and small dimension of feature space that is utilized by the image recognition process. The proposed feature space of logotype classification models is based on the weights of images similarity and the textual data of the images that are received from HTMLs ALT tags.
Article – A Diversified Classification Committee for Recognition of Innovative Internet Domains
The objective of this paper was to propose a classification method of innovative domains on the Internet. The proposed approach helped to estimate whether companies are innovative or not through analyzing their web pages. A Naïve Bayes classification committee was used as the classification system of the domains. The classifiers in the committee were based concurrently on Bernoulli and Multinomial feature distribution models, which were selected depending on the diversity of input data. Moreover, the information retrieval procedures were applied to find such documents in domains that most likely indicate innovativeness. The proposed methods have been verified experimentally. The results have shown that the diversified classification committee combined with the information retrieval approach in the preprocessing phase boosts the classification quality of domains that may represent innovative companies. This approach may be applied to other classification tasks.
Article – The hybrid decision support system for Fire Service – chosen project’s problems
This article describes the process of designing a hybrid decision support system HSWD for the Fire Service. This designing process realize a methodology of design for trustworthy software – DFTS. In this article describes chosen project problems and their solution on the first stage of proposed design process.
Article – Review of methods and text data mining techniques
This article describes the author’s classification of the methods and techniques of textual data mining. In this article also describes the currently available methods and sauces representation of textual data and their processing techniques. Also conducted a discussion on the processing of text documents using the presented methods. This paper also discussed the possibilities and limitations of individual methods to process the presented text documents.
Article – Crowdsourcing in rescue fire service – proposed application
Few days ago a SIMIS magazine publicated my article Crowdsourcing in rescue fire service – proposed application. In this article I describes the proposal to apply crowdsourcing in Polish rescue fire service. This article also describes basic principles for implementing an crowdsourcing information platform in rescue fire service as well as the scheme of its implementation. Of this paper also to I describes the genesis of this proposal related to the evaluation of research conducted by the author on text mining analysis and extraction of information in the design of information systems.
Resolved problem with revoIPC 1.0-3 and new g++
Few days ago I had a installation problem package revoIPC 1.0-3 and doSMP 1.0.1. When I was compiled revoIPC 1.0-3, compilator returned this error:
g++ -I/usr/local/lib64/R/include -I/usr/local/include -I . -fpic
-g -O2 -c interface.cc -o interface.o
In file included from ./boost/interprocess/detail/
from ./boost/interprocess/mapped_
from
./boost/interprocess/detail/
from boost/interprocess/managed_
from queue.h:17,
from interface.cc:3:
./boost/interprocess/detail/
'm_value’ cannot be declared 'mutable’ [-fpermissive]
make: *** [interface.o] Błąd 1
ERROR: compilation failed for package 'revoIPC’
Rich Calaway from revolutionanalytics.com helped me quickly and sent solution. The patch with fix this problem is a:
„… go to src/boost/interprocess/detail/iterators.hpp and comment out lines 341-353”
Thanks for this advice we can compile revoIPC 1.0-3 and use it in with doSMP 1.0.1. Just one thinks, if we use doSMP 1.0.1 in new revoIPC 1.0-4 and we try execute this program:
rmSessions(all.names = TRUE)
w <- startWorkers(2)
registerDoSMP(w)
foreach(i=1:3) %dopar% sqrt(i)
R return this error:
> foreach(i=1:3) %dopar% sqrt(i) *** caught segfault ***
address 0x7fd0e562f58c, cause 'memory not mapped’ Traceback:
1: .Call(„returnResult”, q, t$task, serialize(res, NULL))
2: ipcTaskReturnResult(taskq, taskchunk, resultchunk)
3: doSMP:::workerLoop(qname, rank, verbose, out)
aborting …
Rich Calaway helped me fixed this problem to. I don’t check it but if someone want to use revoIPC 1.0-4 must:
„… remove the PKG_CPPFLAGS=-DNDEBUG line from Makevars.in in the src directory.”
Good luck.
License BSD for choseen projects
Copyright (c) 2009-2010, proFind Marcin Mirończuk PL.
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
* Neither the name of proFind Marcin Mirończuk nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS „AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
New review data mining
For few mounth I created a data mining review – „Data mining review and use’s classification, methods and techniques”. This article can finds in Studia i Materiały Informatyki Stosowanej (SIMIS, http://www.simis.ukw.edu.pl/) if he will favourably reviews. Below I represended abstract of this paper:
The large quantity of the data and information accumulated into actual information systems and their successive extension extorted the development of new processes, techniques and methods to their storing, processing and analysing. Currently the achievement from the statistical analyses and artificial intelligence area are use to the analysis process of the large data sets. These fields make up the core of data exploration – data mining. Currently the data mining aspires to independent scientific method which one uses to solving problems from range of information analysis comes from the data bases menagments systems. In this article was described review and use’s classification, methods and techniques which they are using in the process of the data exploration. In this article also was described actual development direction and described elements which require this young applied discipline of the science.
Questionnaire for fire service
I finished implementation a questionnaire for fire service on the previous day. This questionnaire was implemented for quantitative/qualitative research. The destination of this research is a creation a hybrid decision support system for Polish fire service. It’s a complicate problem which require many of different researches. This research integrate solution from logistic, transport, game theory, artificial intelligence, linguistics, retrieval of text information like a text mining and especially text reprezentation and his processing. All project was described by process of design for trustworthy software (DFTS). The results of investigations will be published after the completion questionnaire by respondenst.
Questionnaire was implemented using web technologies like a JAVA+JSF+Hibernate+MySQL.