What are statistical measures in large databases?

Data MiningDatabaseData Structure

Relational database systems supports five built-in aggregate functions such as count(), sum(), avg(), max() and min(). These aggregate functions can be used as basic measures in the descriptive mining of multidimensional information. There are two descriptive statistical measures such as measures of central tendency and measures of data dispersion can be used effectively in high multidimensional databases.

Measures of central tendency − Measures of central tendency such as mean, median, mode, and mid-range.

Mean − The arithmetic average is evaluated simply by inserting together all values and splitting them by the number of values. It uses data from every single value. Let x1, x2,... xn be a set of N values or observations like salary. The mean of this set of values is

$$\mathrm{X^\prime\:=\:\frac{\sum_{i=1}^N\:X_i}{N}\:=\:\frac{X_1+X_2\:\dotsm\:X_n}{N}}$$

This corresponds to the assembled aggregate function, average (avg()) supported in the relational database system. In several data cubes, sum and count are saved in pre-computation. Therefore, the derivation of average is straightforward.

$\mathrm{average\:=\:\frac{sum}{count}}$

Median − There are two methods for computing the median, based on the distribution of values.

If x1, x2, .... xn are arranged in descending order and n is odd. Thus the median is

$$\mathrm{\left(\frac{n+1}{2}\right)^{th}\:value}$$

For example, 1, 4, 6, 7, 12, 14, 18

Median = 7

When n is even. Then the median is

$$\mathrm{\frac{\left(\frac{n}{2}\right)^{th}value\:+\:\left(\frac{n}{2}\:+\:1\right)^{th} value}{2}}$$

For example, 1, 4, 6, 7, 8, 12, 14, 16.

$$\mathrm{Median\:=\:\frac{7+8}{2}\:=\:7.5}$$

The median is neither a distributive measure nor an algebraic measure, it is the holistic measure. Although it is not simply to evaluate the exact median value in a huge database, an approximate median can be effectively computed.

Mode − It is the most common value in a set of values. Distributions can be unimodal, bimodal, or multimodal. If the data is categorical (measured on the nominal scale) then only the mode can be computed. The mode can also be computed with ordinal and higher data, but it is not suitable.

Measuring the dispersion of data − The degree to which numerical information tends to spread is known as the dispersion or variance of the data. The most frequent measures of data dispersion are range, interquartile range, and standard derivations.

Range − The range is represented as the difference between the largest value and the smallest value in the set of data.

$$\mathrm{Range\:=\:X_L-X_S}$$

Where

$\mathrm{X_L\:\rightarrow\:largest value}$

$\mathrm{X_S\:\rightarrow\:smallest value}$

Quartiles − The most common percentile other than the median are quartiles. The first quartile indicated by Q1 is the 25th percentile, the third quartile indicated by Q3 is the 75th percentile. The quartiles containing the median, provide some indication of the center, spread, and shape of a quartile is a simple measure of spread that provides the range covered by the middle half of the data. This is known as the interquartile range (IQR) and is defined as −

$$\mathrm{IQR\:=\:Q_{3}-Q_{1}}$$

Standard deviation − When the deviate values are squared in variance, their unit of measure is squared also.

raja
Updated on 15-Feb-2022 07:22:15

Advertisements