What is Randomized Algorithms and Data Stream Management System in data mining?

Randomized Algorithms − Randomized algorithms in the form of random sampling and blueprint, are used to deal with large, high-dimensional data streams. The need of randomization leads to simpler and more effective algorithms in contrast to known deterministic algorithms.

If a randomized algorithm continually returns the correct answer but the running times change, it is called a Las Vegas algorithm. In contrast, a Monte Carlo algorithm has bounds on the running time but cannot restore the true result. It can usually consider Monte Carlo algorithms. The importance of a randomized algorithm is simply as a probability distribution over a group of deterministic algorithms.

Given that a randomized algorithm restore a random variable as a result, it is likely to have bounds on the tail probability of that random variable. This communicate us that the probability that a random variable vary from its expected value is short. The main tool is Chebyshev’s Inequality.

Let X be a random variable with mean µ and standard deviation σ (variance σ2). Chebyshev’s inequality says that

$$\mathrm{P(|X-\mu|>k)<\frac{\sigma^2 }{k^2}}$$

for any given positive real number, k. This inequality is used to bound the variance of a random variable. In several cases, multiple random variables can be used to improve the confidence in this results. Considering these random variables are completely independent, Chernoff bounds can be used.

Let X1X2 … Xn be independent Poisson trials. In a Poisson trial, the probability of success change from trial to trial. If X is the sum of X1 to Xn, then a weaker version of the Chernoff bound communicate us that

$$\mathrm{P[X<(1+\delta)\mu]< e^{-\mu\delta^2}}$$

where δ ∈ (0, 1]. This displays that the probability reduce exponentially as it can move from the mean, which creates poor estimates much more unlikely.

Data Stream Management System − In a Data Stream Management System, there are several data streams. They appear on-line and are continuous, temporally series, and possibly infinite. Because a component from a data stream has been treated, it is discarded or archived, and it cannot be simply fetched unless it is explicitly saved in memory.

A stream data query processing structure includes three elements such as end-user, query processor, and scratch space (which can include main memory and disks). An end user concern a query to the DSMS, and the query processor takes the query, processes it using the data saved in the scratch space, and restore the results to the user.

Queries can be one-time queries or continuous queries. A one-time query is computed once over a point-in-time photograph of the data set, with the answer restored to the user. A continuous query is computed continuously as data streams continue to appear.