Archiwum kategorii: Artykuły

Document Classification Pattern Recognition via Information Fusion: A systematic review of multimodal and multiview representation approaches

Information fusion is used widely to improve document classification by the integration of multiple data sources (multimodal) or representations (multiview). However, the field lacks a unified framework, a quantitative synthesis of its effectiveness, and clear guidance for practitioners. This systematic review addresses these gaps by analysing 139 primary studies. It introduces a formal framework to structure the field, presents the results of a qualitative analysis to identify key trends, and performs a random-effects meta-analysis (to our knowledge, the first focused on document classification) to quantify performance gains. Our meta-analysis reveals that multimodal fusion improves accuracy (mean gain of +5.28 percentage points, p = .0016) significantly—the F1-score effect is directionally positive but statistically non-significant in our primary model. Multiview fusion provides consistent but modest gains for accuracy (+4.67%), F1-score (+3.08%), and recall (all p < .05). Critically, our qualitative synthesis uncovers challenges in reproducibility in methodological rigour: only 11.8% (multimodal) and 23.3% (multiview) of the studies use statistical tests to validate their findings, which undermines the reliability of many of their results. This review’s primary contributions are a unifying framework, the first quantitative evidence base, and data-driven guidelines. This review concludes that successful information fusion depends not on algorithmic complexity, but on the strategic alignment of the fusion method with the task context and a commitment to more rigorous validation.

Czytelnik może znaleźć więcej informacji w wersji angielskiej wpisu lub bezpośrednio w artykule.

Unveiling Dual Quality in Product Reviews: An NLP-Based Approach

Dodaj komentarz

Consumers often face inconsistent product quality, particularly when identical products vary between markets, a situation known as the dual quality problem. To identify and address this issue, automated techniques are needed. This paper explores how natural language processing (NLP) can aid in detecting such discrepancies and presents the full process of developing a solution. First, we describe in detail the creation of a new Polish-language dataset with 1,957 reviews, 540 highlighting dual quality issues. We then discuss experiments with various approaches like SetFit with sentence-transformers, transformer-based encoders, and LLMs, including error analysis and robustness verification. Additionally, we evaluate multilingual transfer using a subset of opinions in English, French, and German. The paper concludes with insights on deployment and practical applications.

Czytelnik może znaleźć więcej informacji w wersji angielskiej wpisu lub bezpośrednio w artykule.

The Outcomes and Publication Standards of Research Descriptions in Document Classification: A Systematic Review

Dodaj komentarz

Document classification, a critical area of research, employs machine and deep learning methods to solve real-world problems. This study attempts to highlight the qualitative and quantitative outcomes of the literature review from a broad range of scopes, including machine and deep learning methods, as well as solutions based on nature, biological, or quantum physics-inspired methods. A rigorous synthesis was conducted using a systematic literature review of 102 papers published between 2003 and 2023. The 20 Newsgroups (bydate version) were used as a reference point of benchmarks to ensure fair comparisons of methods. Qualitative analysis revealed that recent studies utilize Graph Neural Networks (GNNs) combined with models based on the transformer architecture and propose end-to-end solutions. Quantitative analysis demonstrated state-of-the-art results, with accuracy, micro and macro F1-scores of 90.38%, 88.28%, and 89.38%, respectively. However, the reproducibility of many studies may need to be revised for the scientific community. The resulting overview covers a wide range of document classification methods and can contribute to a better understanding of this field. Additionally, the systematic review approach reduces systematic error, making it useful for researchers in the document classification community.

Czytelnik może znaleźć więcej informacji w wersji angielskiej wpisu lub bezpośrednio w artykule.

Leveraging spiking neural networks for topic modeling

Dodaj komentarz

This article investigates the application of spiking neural networks (SNNs) to the problem of topic modeling (TM): the identification of significant groups of words that represent human-understandable topics in large sets of documents. Our research is based on the hypothesis that an SNN that implements the Hebbian learning paradigm is capable of becoming specialized in the detection of statistically significant word patterns in the presence of adequately tailored sequential input. To support this hypothesis, we propose a novel spiking topic model (STM) that transforms text into a sequence of spikes and uses that sequence to train single-layer SNNs. In STM, each SNN neuron represents one topic, and each of the neuron’s weights corresponds to one word. STM synaptic connections are modified according to spike-timing-dependent plasticity; after training, the neurons’ strongest weights are interpreted as the words that represent topics. We compare the performance of STM with four other TM methods Latent Dirichlet Allocation (LDA), Biterm Topic Model (BTM), Embedding Topic Model (ETM) and BERTopic on three datasets: 20Newsgroups, BBC news, and AG news. The results demonstrate that STM can discover high-quality topics and successfully compete with comparative classical methods. This sheds new light on the possibility of the adaptation of SNN models in unsupervised natural language processing.

Czytelnik może znaleźć więcej informacji w wersji angielskiej wpisu lub bezpośrednio w artykule.

Biologically Plausible Learning of Text Representation with Spiking Neural Networks

Dodaj komentarz

This study proposes a novel biologically plausible mechanism for generating low-dimensional spike-based text representation. First, we demonstrate how to transform documents into series of spikes (spike trains) which are subsequently used as input in the training process of a spiking neural network (SNN). The network is composed of biologically plausible elements, and trained according to the unsupervised Hebbian learning rule, Spike-Timing-Dependent Plasticity (STDP). After training, the SNN can be used to generate low-dimensional spike-based text representation suitable for text/document classification. Empirical results demonstrate that the generated text representation may be effectively used in text classification leading to an accuracy of $80.19 %$ on the bydate version of the 20 newsgroups data set, which is a leading result amongst approaches that rely on low-dimensional text representations.

Czytelnik może znaleźć więcej informacji w wersji angielskiej wpisu lub bezpośrednio w artykule.

Recognising innovative companies by using a diversified stacked generalisation method for website classification

Dodaj komentarz

In this paper, we propose a classification system which is able to decide whether a company is innovative or not, based only on its public website available on the internet. As innovativeness plays a crucial role in the development of myriad branches of the modern economy, an increasing number of entities are expending effort to be innovative. Thus, a new issue has appeared: how can we recognise them? Not only is grasping the idea of innovativeness challenging for humans, but also impossible for any known machine learning algorithm. Therefore, we propose a new indirect technique: a diversified stacked generalisation method, which is based on a combination of a multi-view approach and a genetic algorithm. The proposed approach achieves better performance than all other classification methods which include: (i) models trained on single datasets; or (ii) a simple voting method on these models. Furthermore, in this study, we check if unaligned feature space improves classification results. The proposed solution has been extensively evaluated on real data collected from companies’ websites. The experimental results verify that the proposed method improves the classification quality of websites which might represent innovative companies.

Czytelnik może znaleźć więcej informacji w wersji angielskiej wpisu lub bezpośrednio w artykule.

Information Extraction System for Transforming Unstructured Text Data in Fire Reports into Structured Forms: A Polish Case Study

Dodaj komentarz

In this paper, the author presents a novel information extraction system that analyses fire service reports. Although the reports contain valuable information concerning fire and rescue incidents, the narrative information in these reports has received little attention as a source of data. This is because of the challenges associated with processing these data and making sense of the contents through the use of machines. Therefore, a new issue has emerged: How can we bring to light valuable information from the narrative portions of reports that currently escape the attention of analysts? The idea of information extraction and the relevant system for analysing data that lies outside existing hierarchical coding schemes can be challenging for researchers and practitioners. Furthermore, comprehensive discussion and propositions of such systems in rescue service areas are insufficient. Therefore, the author comprehensively and systematically describes the ways in which information extraction systems transform unstructured text data from fire reports into structured forms. Each step of the process has been verified and evaluated on real cases, including data collected from the Polish Fire Service. The realisation of the system has illustrated that we must analyse not only text data from the reports but also consider the data acquisition process. Consequently, we can create suitable analytical requirements. Moreover, the quantitative analysis and experimental results verify that we can (1) obtain good results of the text segmentation (F-measure 95.5%) and classification processes (F-measure 90%) and (2) implement the information extraction process and perform useful analysis.

Czytelnik może znaleźć więcej informacji w wersji angielskiej wpisu lub bezpośrednio w artykule.

Empirical evaluation of feature projection algorithms for multi-view text classification

Dodaj komentarz

This study aims to propose (i) a multi-view text classification method and (ii) a ranking method that allows for selecting the best information fusion layer among many variations. Multi-view document classification is worth a detailed study as it makes it possible to combine different feature sets into yet another view that further improves text classification. For this purpose, we propose a multi-view framework for text classification that is composed of two levels of information fusion. At the first level, classifiers are constructed using different data views, i.e. different vector space models by various machine learning algorithms. At the second level, the information fusion layer uses input information using a features projection method and a meta-classifier modelled by a selected machine learning algorithm. A final decision based on classification results produced by the models positioned at the first layer is reached. Moreover, we propose a ranking method to assess various configurations of the fusion layer. We use heuristics that utilise statistical properties of F-score values calculated for classification results produced at the fusion layer. The information fusion layer of the classification framework and ranking method has been empirically evaluated. For this purpose, we introduce a use case checking whether companies’ domains identify their innovativeness. The results empirically demonstrate that the information fusion layer enhances classification quality. The Friedman’s aligned rank and Wilcoxon signed-rank statistical tests and the effect size support this hypothesis. In addition, the Spearman statistical test carried out for the obtained results demonstrated that the assessment made by the proposed ranking method converges to a well-established method named Hellinger – The Technique for Order Preference by Similarity to Ideal Solution (H-TOPSIS). Thus, the proposed approach may be used for the assessment of classifier performance.

Czytelnik może znaleźć więcej informacji w wersji angielskiej wpisu lub bezpośrednio w artykule.

A recent overview of the state-of-the-art elements of text classification

Dodaj komentarz

The aim of this study is to provide an overview the state-of-the-art elements of text classification. For this purpose, we first select and investigate the primary and recent studies and objectives in this field. Next, we examine the state-of-the-art elements of text classification. In the following steps, we qualitatively and quantitatively analyse the related works. Herein, we describe six baseline elements of text classification including data collection, data analysis for labelling, feature construction and weighing, feature selection and projection, training of a classification model, and solution evaluation. This study will help readers acquire the necessary information about these elements and their associated techniques. Thus, we believe that this study will assist other researchers and professionals to propose new studies in the field of text classification.

Czytelnik może znaleźć więcej informacji w wersji angielskiej wpisu lub bezpośrednio w artykule.

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Dodaj komentarz

W Knowledge and Information Systems ukazał się artykuł The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction. Podążając za abstraktem – The aim of this study is to propose an information extraction system, called BigGrams, which is able to retrieve relevant and structural information (relevant phrases, keywords) from semi-structural web pages, i.e. HTML documents. For this purpose, a novel semi-supervised wrappers induction algorithm has been developed and embedded in the BigGrams system. The wrappers induction algorithm utilizes a formal concept analysis to induce information extraction patterns. Also, in this article, the author (1) presents the impact of the configuration of the information extraction system components on information extraction results and (2) tests the boosting mode of this system. Based on empirical research, the author established that the proposed taxonomy of seeds and the HTML tags level analysis, with appropriate pre-processing, improve information extraction results. Also, the boosting mode works well when certain requirements are met, i.e. when well-diversified input data are ensured.

Czytelnik może znaleźć więcej informacji w wersji angielskiej wpisu lub bezpośrednio w artykule.

Od Informacji do Wiedzy

Blog o informacjach na temat informacji i wiedzy

Archiwum kategorii: Artykuły

Document Classification Pattern Recognition via Information Fusion: A systematic review of multimodal and multiview representation approaches

Unveiling Dual Quality in Product Reviews: An NLP-Based Approach

The Outcomes and Publication Standards of Research Descriptions in Document Classification: A Systematic Review

Leveraging spiking neural networks for topic modeling

Biologically Plausible Learning of Text Representation with Spiking Neural Networks

Recognising innovative companies by using a diversified stacked generalisation method for website classification

Information Extraction System for Transforming Unstructured Text Data in Fire Reports into Structured Forms: A Polish Case Study

Empirical evaluation of feature projection algorithms for multi-view text classification

A recent overview of the state-of-the-art elements of text classification

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction