What are statistical measures in large databases?

Relational database systems supports five built-in aggregate functions such as count(), sum(), avg(), max() and min(). These aggregate functions can be used as basic measures in the descriptive mining of multidimensional information. There are two descriptive statistical measures such as measures of central tendency and measures of data dispersion can be used effectively in high multidimensional databases.

Measures of central tendency − Measures of central tendency such as mean, median, mode, and mid-range.

Mean − The arithmetic average is evaluated simply by inserting together all values and splitting them by the number of values. It uses data from every single value. Let x₁, x₂,... x_n be a set of N values or observations like salary. The mean of this set of values is

$$\mathrm{X^\prime\:=\:\frac{\sum_{i=1}^N\:X_i}{N}\:=\:\frac{X_1+X_2\:\dotsm\:X_n}{N}}$$

This corresponds to the assembled aggregate function, average (avg()) supported in the relational database system. In several data cubes, sum and count are saved in pre-computation. Therefore, the derivation of average is straightforward.

$\mathrm{average\:=\:\frac{sum}{count}}$

Median − There are two methods for computing the median, based on the distribution of values.

If x₁, x₂, .... x_n are arranged in descending order and n is odd. Thus the median is

$$\mathrm{\left(\frac{n+1}{2}\right)^{th}\:value}$$

For example, 1, 4, 6, 7, 12, 14, 18

Median = 7

When n is even. Then the median is

$$\mathrm{\frac{\left(\frac{n}{2}\right)^{th}value\:+\:\left(\frac{n}{2}\:+\:1\right)^{th} value}{2}}$$

For example, 1, 4, 6, 7, 8, 12, 14, 16.

$$\mathrm{Median\:=\:\frac{7+8}{2}\:=\:7.5}$$

The median is neither a distributive measure nor an algebraic measure, it is the holistic measure. Although it is not simply to evaluate the exact median value in a huge database, an approximate median can be effectively computed.

Mode − It is the most common value in a set of values. Distributions can be unimodal, bimodal, or multimodal. If the data is categorical (measured on the nominal scale) then only the mode can be computed. The mode can also be computed with ordinal and higher data, but it is not suitable.

Measuring the dispersion of data − The degree to which numerical information tends to spread is known as the dispersion or variance of the data. The most frequent measures of data dispersion are range, interquartile range, and standard derivations.

Range − The range is represented as the difference between the largest value and the smallest value in the set of data.

$$\mathrm{Range\:=\:X_L-X_S}$$

Where

$\mathrm{X_L\:\rightarrow\:largest value}$

$\mathrm{X_S\:\rightarrow\:smallest value}$

Quartiles − The most common percentile other than the median are quartiles. The first quartile indicated by Q₁ is the 25^th percentile, the third quartile indicated by Q₃ is the 75^th percentile. The quartiles containing the median, provide some indication of the center, spread, and shape of a quartile is a simple measure of spread that provides the range covered by the middle half of the data. This is known as the interquartile range (IQR) and is defined as −

$$\mathrm{IQR\:=\:Q_{3}-Q_{1}}$$

Standard deviation − When the deviate values are squared in variance, their unit of measure is squared also.

Ginni

Updated on: 15-Feb-2022

3K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started