What are the methods of Text Retrieval?

Data MiningDatabaseData Structure

Text retrieval is the process of transforming unstructured text into a structured format to identify meaningful patterns and new insights. By using advanced analytical techniques, including Naïve Bayes, Support Vector Machines (SVM), and other deep learning algorithms, organizations are able to explore and find hidden relationships inside their unstructured data. There are two methods of text retrieval which are as follows −

Document Selection − In document selection methods, the query is regarded as defining constraint for choosing relevant documents. A general approach of this category is the Boolean retrieval model, in which a document is defined by a set of keywords and a user provides a Boolean expression of keywords, such as car and repair shops, tea or coffee, or database systems but not Oracle.

The retrieval system can take such a Boolean query and return records that satisfy the Boolean expression. Because of the complexity in prescribing a user’s data required exactly with a Boolean query, the Boolean retrieval techniques usually only work well when the user understands a lot about the document set and can formulate the best query in this way.

Document ranking − Document ranking methods use the query to rank all records in the order of applicability. For ordinary users and exploratory queries, these techniques are more suitable than document selection methods. Most current data retrieval systems present a ranked list of files in response to a user’s keyword query.

There are several ranking methods based on a huge spectrum of numerical foundations, such as algebra, logic, probability, and statistics. The common intuition behind all of these techniques is that it can connect the keywords in a query with those in the records and score each record depending on how well it matches the query.

The objective is to approximate the degree of relevance of records with a score computed depending on the information including the frequency of words in the document and the whole set. It is inherently difficult to provide a precise measure of the degree of relevance between a set of keywords. For example, it is difficult to quantify the distance between data mining and data analysis.

The most popular approach of this method is the vector space model. The basic idea of the vector space model is the following: It can represent a document and a query both as vectors in a high-dimensional space corresponding to all the keywords and use an appropriate similarity measure to evaluate the similarity among the query vector and the record vector. The similarity values can then be used for ranking documents.

Updated on 25-Nov-2021 09:55:26