Hey everyone! I’m excited to share some insights from my latest research paper, titled “A Diversified Classification Committee for Recognition of Innovative Internet Domains.” This work was published in the Communications in Computer and Information Science conference materials, and I collaborated with Jaroslaw Protasiewicz from the National Information Processing Institute in Warsaw, Poland. The goal of our research was to develop a method to identify innovative companies by analyzing their websites. Let me break it down for you in simple terms!
What’s the Big Idea?
Innovation is a key driver of economic growth, but identifying innovative companies isn’t always straightforward. We wanted to create a system that could automatically analyze a company’s website and determine whether it’s innovative or not. To do this, we used text mining and machine learning techniques to classify websites into two categories: innovative or not innovative.
How Did We Do It?
We built a classification committee made up of multiple Naive Bayes classifiers. These classifiers were based on two different models: Bernoulli and Multinomial distributions. The idea was to combine the strengths of these models to improve the accuracy of our classification system.
Here’s a quick overview of the steps we took:
-
Data Collection: We crawled the web to collect data from company websites. This included things like the main page text, links, and logos.
-
Feature Extraction: We used text mining techniques to extract important features from the websites. This involved tokenizing text, identifying key phrases, and even recognizing named entities (like company names or product names).
-
Information Retrieval: We used the Okapi BM25 ranking function (a popular search engine algorithm) to find the most relevant documents on each website that might indicate innovativeness.
-
Classification: We trained our Naive Bayes classifiers on the extracted features and used a committee voting system to make the final decision on whether a company was innovative or not.
What Did We Find?
Our experiments showed that the diversified classification committee outperformed individual classifiers. Here are some key takeaways:
-
Feature Selection Matters: We tested different methods for selecting the most important features (like Fisher ranking and Chi-squared tests). The best results came from using the Fisher method for feature selection.
-
Committee Voting Works: By combining the results of multiple classifiers, we were able to achieve more stable and accurate predictions. The committee was especially effective when the number of features was between 600 and 4,000.
-
Overfitting is a Challenge: As we increased the number of features, some classifiers started to overfit the data (i.e., they performed well on the training data but poorly on new data). However, the committee approach helped mitigate this issue.
Why Does This Matter?
Our system can be used to automatically identify innovative companies on the web, which could be incredibly useful for investors, researchers, and policy makers who want to support innovation. Plus, the methods we developed aren’t just limited to identifying innovative companies—they can be applied to other classification tasks as well.
What’s Next?
While our results are promising, there’s still room for improvement. In the future, we plan to explore adaptive feature selection for each classifier in the committee, which could further boost performance. We also want to experiment with iterative training using manually labeled data to improve accuracy.
Final Thoughts
This research was a challenging but rewarding journey, and I’m proud of what we’ve accomplished. If you’re interested in the technical details, you can check out the full paper here.