R* Tree in Data Structure

Basic concept

In case of data processing, R*-trees are defined as a variant of R-trees implemented for indexing spatial information.

R*-trees have slightly larger construction cost than standard R-trees, as the data may require to be reinserted; but the resulting tree will generally have a better query performance. Same as the standard R-tree, it can store both point and spatial data. Concept of R*-tree was proposed by Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger in 1990.

Difference between R*-trees and R-trees

R*-Tree is constructed by repeated insertion. There is little (i.e. almost no) overlap in this tree, resulting in good query performance. Minimization of both coverage and overlap is very important to the performance of R-trees. Meaning of overlap, on data insertion or query, is that more than one branch of the tree requires to be expanded (due to the way data is being divided in regions which may overlap). A minimized coverage enhances pruning performance, permitting to exclude whole pages from search frequently, in particular for negative range queries. The R*-tree attempts to reduce both, implementing a collection of a revised node split algorithm and the concept of forced reinsertion at node overflow. This concept is based on the observation that R-tree structures are highly susceptible to the order in which their entries are inserted, so an insertion-built (rather than bulk-loaded) structure is likely to be sub-optimal. Deletion and reinsertion of entries permits them to "find" a place in the tree that may be more appropriate than their actual location.

Algorithm and complexity

  • The R*-tree implements the similar algorithm as the regular R-tree for query and delete operations.
  • At the time of inserting, the R*-tree implements a combined strategy. For leaf nodes, overlap is minimized, while for inner nodes, enlargement and area are minimized.
  • At the time of splitting, the R*-tree implements a topological split that selects a split axis based on perimeter, then minimizes overlap.
  • In addition to an enhanced split strategy, the R*-tree also tries to skip splits by reinserting objects and subtrees into the tree, inspired by the concept of balancing a B-tree.

Worst case query and delete complexity are thus similar to the R-Tree. The insertion strategy to the R*-tree is with O(M log M) more complex than the linear split strategy (O(M)) of the R-tree, but less complex than the quadratic split strategy (O(M2)) for a page size of M objects and has little impact on the total complexity. The total insert complexity is still comparable to the R-tree: reinsertions affect maximum one branch of the tree and thus O(log n) reinsertions, comparable to performing a division on a regular R-tree. So on overall, the complexity of the R*-tree is the similar to that of a regular R-tree.