What are the components of focused web crawlers?


There are various components of focused web crawlers which are as follows −

Seed detector − The service of the Seed detector is to decide the seed URLs for the definite keyword by fetching the first n URLs. The seed pages are identified and assigned a priority depending on the PageRank algorithm or the hits algorithm or algorithm same to that.

Crawler Manager − The Crawler Manager is an essential component of the system following the Hypertext Analyzer. The component downloads the files from the global web. The URLs in the URL repository are retrieved and created to the buffer in the Crawler Manager.

The URL buffer is a priority queue. It depends on the size of the URL buffer, the Crawler Manager dynamically creates an instance for the crawlers, which will download the files.

For more effectiveness, the crawler manager can generate a crawler pool. The manager is also answerable for limiting the speed of the crawlers and balancing the load between them. This is completed by inspecting the crawlers.

Crawler − The crawler is a multi-thread Java code, which is adequate for downloading the web pages from the web and saving the files in the document repository. Every crawler has its queue, which influences the list of URLs to be crawled. The crawler retrieved the URL from the queue.

The different crawlers would have shared a request to a similar server. Therefore sending the request to a similar server will result in overloading the server. The server is active in completing the request that has to appear from the crawlers that have shared the request and looking forward to the response.

The server is created synchronized. If the request for the URL has not been shared previously, the request is forwarded to the HTTP structure. This provides that the crawler doesn’t overload some servers.

Link Extractor − The link extractor derives the connection from the files present in the document repository. The component tests for the URL being in the URL retrieved. If not discovered, the surrounding text preceding and succeeding the hyperlink, the heading or sub-heading under which the connection is present, are extracted.

Hypertext Analyzer − The Hypertext Analyzer gets the keywords from the Link Extractor and discovers the relevancy of the methods with the search keyword defining the Taxonomy Hierarchy.

HTTP Protocol Module − HTTP Protocol Module shares the request for the files whose URL has been acknowledged from the queue. It is upon receiving the document, the URL of the document downloaded is stored in the URL fetched along with the timestamp and the document is stored in the document repository.

Updated on: 16-Feb-2022

652 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements