What are the techniques of Text Indexing?

Data MiningDatabaseData Structure

There are several popular text retrievals indexing techniques such as inverted indices and signature files.

Inverted Index − An inverted index is an index structure that maintains two hash indexed or B+-tree indexed tables: document_table and term_table, where document_table consists of a set of document records, each including two fields: doc_id and posting_list, where posting_list is a list of methods (or pointers to methods) that appears in the document, arranged according to some relevance measure.

term_table includes a set of term records, each including two fields: term_id and posting_list, where posting_list specifies a list of records identifiers in which the term occurs.

It can find all of the documents associated with a given set of terms. It is used to find all of the terms associated with a given set of documents. For example, it can find all of the documents associated with a set of terms, we can first find a list of document identifiers in the term table for each term, and then intersect them to obtain the collection of relevant records.

Inverted indices are broadly used in the market. They are simple to execute. The posting lists can be rather long, creating the storage requirement quite large. They are simple to implement but are not satisfactory at managing synonymy (where two very different words can have equal meaning) and polysemy (where a single word can have several meanings).

A signature file is a file that saves signature data for each record in the database.Each signature has a constant size of b bits defining terms. A simple encoding design goes as follows. Each bit of a record signature is started to 0.

A bit is set to 1 if the term it defines appears in the records. A signature S1 matches another signature S2 if each bit that is set in signature S2 is also set in S1. Because there are generally more terms than available bits, several terms can be mapped into a similar bit.

Such multiple-to-one mappings create the search expensive because a record that connects the signature of a query does not necessarily include the set of keywords of the query. The records have to be retrieved, parsed, stemmed and tested.Improvements can be created by first implementing frequency analysis, stemming,and filtering stop words, and then utilizing hashing methods and superimposed coding techniques to encode the list of methods into bit representation.

Published on 25-Nov-2021 09:56:50