What is Web Mining?

Web mining can widely be viewed as the application of adapted data mining methods to the web, whereas data mining is represented as the application of the algorithm to find patterns on mostly structured data fixed into a knowledge discovery process.

Web mining has a distinctive property to support a collection of multiple data types. The web has several aspects that yield multiple approaches for the mining process, such as web pages including text, web pages are connected via hyperlinks, and user activity can be monitored via web server logs.

It is based on the following observations, the Web also poses great challenges for effective resource and knowledge discovery.

The Web seems to be too large for efficient data warehousing and data mining − The size of the Web is in the order of hundreds of terabytes and is still growing rapidly. Some organizations and societies place several public-accessible data on the Web. It is applicable to set up a data warehouse to replicate, save, or integrate some data on the Web.

The complexity of Web pages is far greater than that of any traditional text document collection − Web pages lack a unifying structure. They contain far more authoring style and content variations than any set of books or other traditional textbased documents.

The Web is treated as a huge digital library; but, the tremendous number of records in this library are not arranged according to any specific sorted order. There is no index by the element, nor by title, author, cover page, table of contents, etc. It can be very challenging to search for the information you desire in such a library.

The Web is a highly dynamic information source − It does not only do the Web grow rapidly, but its information is also constantly updated. News, stock markets, weather, sports, shopping, company advertisements, and numerous other Web pages are updated regularly on the Web. Linkage information and access records are also updated frequently.

The Web serves a broad diversity of user communities − The Internet currently connects more than 100 million workstations, and its user community is still rapidly expanding. Users can have multiple backgrounds, interests, and usage goals.

Some users may not have the best knowledge of the structure of the data network and cannot be aware of the huge cost of a specific search. They can easily get lost by groping in the “darkness” of the network or become bored by taking many access “hops” and waiting impatiently for a piece of information.