In today’s digital age, we’re flooded with information, especially scientific publications. These documents often contain text in multiple languages—think of a paper with an English abstract, Polish keywords, and a French introduction. Manually categorizing such documents is time-consuming and inefficient. That’s where machine learning comes in!
The goal of our research was to develop a system that could automatically classify these multilingual documents into relevant scientific domains, such as Applied Sciences, Health Sciences, or Natural Sciences.
Our Solution: A Three-Level Classification System
We designed a compound classification system with three main layers:
-
Preprocessing Layer:
This layer prepares the documents for analysis by transforming them into a format that machine learning models can understand. We used two popular methods: TF-IDF (for traditional models) and Word2Vec (for more advanced models). -
Classification Layer:
Here, we trained multiple classifiers to analyze the documents. We experimented with two algorithms:-
Multinomial Naive Bayes (MNB): A simple yet effective algorithm for text classification.
-
Long Short-Term Memory (LSTM): A more complex algorithm that considers the order of words in a sentence, which is useful for understanding context.
-
-
Decision Layer:
This final layer combines the outputs from the classifiers to make the final decision about which category a document belongs to. We used a logistic regression model to weigh the probabilities from each classifier and assign the most likely category.
Key Findings: What Worked Best?
We ran several experiments to test different approaches, and here’s what we discovered:
-
Breaking Documents into Parts Works Better:
Instead of treating a document as a single block of text, we split it into parts (like titles, abstracts, and keywords). We found that training separate models for each part improved classification accuracy. For example, keywords were particularly useful in determining the document’s category. -
Multilingual Models Outperform Monolingual Ones:
When dealing with multilingual documents, it’s better to use models that can handle multiple languages at once, rather than training separate models for each language. This approach led to higher accuracy in our experiments. -
Simple Algorithms Can Be Just as Effective:
While the more advanced LSTM algorithm performed slightly better in some cases, the simpler Multinomial Naive Bayes algorithm was almost as good—and much faster to train. So, unless you need the extra boost in accuracy, MNB is a great choice.
Why Does This Matter?
Our system can help automate the organization of multilingual scientific documents, making it easier for researchers, librarians, and institutions to manage large collections of publications. This is especially useful in today’s globalized world, where scientific collaboration often spans multiple languages.
What’s Next?
While our system performed well, there’s still room for improvement. In the future, we plan to test it on documents that contain even more languages simultaneously. We also aim to make our dataset publicly available so others can build on our work.
Final Thoughts
Categorizing multilingual documents is a complex task, but with the right tools and approaches, it’s definitely achievable. Our compound classification system offers a promising solution, and we’re excited to see how it can be applied in real-world scenarios.
If you’re interested in the technical details, feel free to check out the full paper here. Also, we create summarization in form of the poster available here.