Recognising innovative companies by using a diversified stacked generalisation method for website classification

In today’s fast-paced economy, innovation is the cornerstone of success for companies across various industries. From healthcare to agriculture, security to defense, and beyond, businesses are constantly striving to develop groundbreaking products and services. But how can we identify these innovative companies, especially when the only information available is their public website? This is the challenge we set out to address in our recent study, “Recognising innovative companies by using a diversified stacked generalisation method for website classification,” published in Applied Intelligence.

The Challenge of Identifying Innovativeness

Innovation is a complex and multifaceted concept. While some companies showcase their innovative products or services on their websites, others may not have visible innovations but are still working on groundbreaking projects. Traditional methods of assessing innovativeness, such as surveys or questionnaires, require direct interaction with the companies, which can be time-consuming and impractical for large-scale analysis.

To overcome this, we developed an automated system that uses machine learning to classify companies as innovative or non-innovative based solely on their websites. This approach is not only efficient but also scalable, allowing us to analyze thousands of websites in a fraction of the time it would take using traditional methods.

Our Approach: Diversified Stacked Generalisation

Our proposed method, the diversified stacked generalisation, is a novel approach that combines multiple data sources and a genetic algorithm to improve classification accuracy. Here’s how it works:

Multi-View Data Analysis: We analyze a company’s website from three different perspectives:
- Company Description: The text on the website’s main page, which often contains descriptions of products, services, and key personnel.
- Link Labels: The labels of links on the website, which may indicate partnerships or collaborations with other companies.
- Big Document: A comprehensive text representation created by combining the most relevant documents from the website using the Okapi BM25 search system.
Classification Models: For each data source, we train a Naive Bayes (NB) classifier. These classifiers independently assess the likelihood that a company is innovative based on the specific data they analyze.
Meta-Classifier: The outputs from the three NB classifiers are then fed into a meta-classifier, which makes the final decision. We tested several meta-classification methods, including Decision Trees (DT), Support Vector Machines (SVM), and k-Nearest Neighbors (k-NN), and found that DT and SVM performed particularly well.
Genetic Algorithm for Feature Selection: To further enhance the model’s performance, we used a genetic algorithm to optimize the feature space for each classifier. This allowed us to select the most relevant features for each data source, improving the overall classification accuracy.

Key Findings and Results

Our experiments demonstrated that the diversified stacked generalisation method outperforms traditional classification approaches, especially when the number of features is high. Here are some of the key findings:

Improved Classification Accuracy: The meta-classifiers (DT, SVM, and k-NN) consistently outperformed the simple voting method, with improvements of up to 11% in F-score.
Robustness to Overfitting: The system is particularly robust to overfitting, especially when analyzing large documents. This is crucial for maintaining high accuracy when dealing with complex and diverse website content.
Unaligned Feature Spaces: By using a genetic algorithm to create unaligned feature spaces (where each classifier uses a different number of features), we were able to further improve the classification results by 4.6% in F-score.

Why This Matters

The ability to automatically identify innovative companies based on their websites has significant implications for various stakeholders:

Investors: Quickly identify companies with high innovation potential for investment opportunities.
Research Institutions: Find potential partners for collaborative research and development projects.
Governments: Monitor and support innovative businesses to foster economic growth.

Looking Ahead

While our method has shown promising results, there is still room for improvement. In future work, we plan to explore multilingual text classification and expand the system’s capabilities to handle websites in different languages. Additionally, we aim to enhance the model’s robustness to fake innovativeness, such as the use of buzzwords or misleading advertising on websites.

Conclusion

Innovation is the driving force behind modern economic development, and our study provides a powerful tool for identifying innovative companies based on their online presence. By leveraging machine learning and advanced classification techniques, we can unlock new opportunities for collaboration, investment, and growth in the global economy.

If you’re interested in learning more about our research, you can read the full article here. We’re excited to continue exploring the potential of this approach and look forward to sharing more insights in the future.

Pages: 1 2

Od Informacji do Wiedzy

Blog o informacjach na temat informacji i wiedzy