System and method for extracting and categorizing information from online sources
Assignee
6SENSE INSIGHTS, INC.
Inventors
Ernest Kirubakaran Selvaraj, Samira Golsefid, Viral Tarun Bajaria, Satish Arjun Chilloji, Akshay Rajendra Shah, Amresh Sekar, Shubham Kumar Sunwalka
Abstract
A system and method for efficiently extracting and categorizing business information from online sources is disclosed. The system comprises a web crawler that obtains company domains from a database and collects depth-1 URLs from company homepages. A classification model, utilizing a fine-tuned BERT architecture, predicts which URLs contain relevant information for generating tags. A content extractor then extracts content from these predicted URLs using one or more modules. Finally, a large language model (LLM) processes the extracted content and generates tags using custom prompts designed for each tag category. These prompts are tailored to the nature of the extracted content, enhancing the context provided to the LLM. This multi-stage approach addresses challenges in processing large-scale, unstructured business data from diverse web sources, potentially offering improved efficiency, scalability, and accuracy in automated business intelligence gathering.
CPC Classifications
Filing Date
2025-01-15
Application No.
19022441
Claims
10