What is information retrieval?

Data MiningDatabaseData Structure

Information retrieval (IR) is a field that has been developing in parallel with database systems for many years. Unlike the field of database systems, which has targeted query and transaction processing of structured data, information retrieval is concerned with the organization and retrieval of data from multiple text-based documents.

Since information retrieval and database systems each handle different kinds of data, some database system problems are usually not present in information retrieval systems, such as concurrency control, recovery, transaction management, and update. There are some common information retrieval problems that are usually not encountered in traditional database systems, such as unstructured documents, approximate search based on keywords, and the notion of relevance.

Because of the abundance of text data, information retrieval has discovered several applications. There exist several information retrieval systems, including online library catalog systems, online records management systems, and the more currently developed Web search engines.

A general data retrieval problem is to locate relevant documents in a document set depending on a user’s query, which is often some keywords defining an information need, although it can also be an example of relevant records.

This is most suitable when a user has some ad hoc (i.e., short-term) data need, including finding data to buy a used car. When a user has a long-term data need (e.g., a researcher’s interests), a retrieval system can also take the initiative to “push” any newly arrived data elements to a user if the element is judged as being relevant to the user’s data need.

There are two basic measures for assessing the quality of text retrieval which are as follows −

Precision − This is the percentage of retrieved data that are actually relevant to the query (i.e., “correct” responses). It is formally represented as

$$precision=\frac{|\left\{ Relevant \right\}\cap\left\{ Retrieved \right\}|}{|\left\{ Retrieved \right\}|}$$

Recall − This is the percentage of records that are relevant to the query and were actually retrieved. It is formally represented as

$$recall=\frac{|\left\{ Relevant \right\}\cap\left\{ Retrieved \right\}|}{|\left\{ Relevant \right\}|}$$

An information retrieval system is often required to trade-off recall for precision or vice versa. There is one generally used trade-off is the F-score, which is represented as the harmonic mean of recall and precision −

$$F\underline{}score=\frac{recall \times precision }{(recall+precision)^{2}}$$

The harmonic means trouble a system that sacrifices one measure for another too extremely. Precision, recall, and F-score is the basic measures of a retrieved collection of records. These three measures are not generally useful for comparing two ranked lists of files because they are not sensitive to the internal ranking of the documents in a retrieved set.

Updated on 25-Nov-2021 09:52:56