What is Numerosity Reduction?


In the Numerosity reduction, the data volume is reduced by choosing an alternative, smaller form of data representation. These techniques may be parametric or nonparametric. For parametric methods, a model is used to estimate the data, so that only the data parameters need to be stored, instead of the actual data, for example, Log-linear models. Non-parametric methods are used for storing a reduced representation of the data which include histograms, clustering, and sampling.

There are the following techniques of numerosity reduction which are as follows −

Regression and Log-Linear Models − These models can be used to approximate the given data. In linear regression, the data are modeled to fit a straight line. For instance, a random variable, y (known as response variable), can be modeled as a linear function of another random variable, x (known as a predictor variable), with the equation y = wx+b, where the variance of y is assumed to be constant.

Log-linear models − These models are used to approximate discrete multidimensional probability distributions. Given a set of tuples in n dimensions (e.g., by n attributes), it can consider each tuple as a point in an n-dimensional space.

Log-linear models can be used to measure the probability of each point in a multidimensional space for a set of discretized attributes, depends on a smaller subset of dimensional combinations. This enables a higher-dimensional data field to be generated from lower-dimensional spaces.

Histograms − Histograms use binning to approximate data distributions and are a famous form of data reduction. A histogram for an attribute, A, divisions the data distribution of A into disjoint subsets, or buckets. If each bucket defines only an individual attribute-value/frequency pair, the buckets are known as singleton buckets.

Clustering − Clustering techniques consider data tuples as objects. They partition the objects into groups or clusters so that objects within a cluster are “similar” to one another and “dissimilar” to objects in other clusters. It is commonly defined in terms of how “close” the objects are in space, based on a distance function.

The quality of a cluster can be defined by its diameter, the maximum distance between any two objects in the cluster. Centroid distance is an alternative measure of cluster quality and is represented as the average distance of each cluster object from the cluster centroid denoting the “average object,” or average point in the area for the cluster.

Sampling − Sampling can be used as a data reduction approach because it enables a huge data set to be defined by a much smaller random sample (or subset) of the information.

Updated on: 19-Nov-2021

913 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements