Agile Data Science - Data Enrichment


Advertisements

Data enrichment refers to a range of processes used to enhance, refine and improve raw data. It refers to useful data transformation (raw data to useful information). The process of data enrichment focusses on making data a valuable data asset for modern business or enterprise.

The most common data enrichment process includes correction of spelling mistakes or typographical errors in database through use of specific decision algorithms. Data enrichment tools add useful information to simple data tables.

Consider the following code for spell correction of words −

import re
from collections import Counter
def words(text): return re.findall(r'\w+', text.lower())
WORDS = Counter(words(open('big.txt').read()))

def P(word, N=sum(WORDS.values())):
   "Probabilities of words"
   return WORDS[word] / N
	
def correction(word):
   "Spelling correction of word"
   return max(candidates(word), key=P)
	
def candidates(word):
   "Generate possible spelling corrections for word."
   return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])
	
def known(words):
   "The subset of `words` that appear in the dictionary of WORDS."
   return set(w for w in words if w in WORDS)
	
def edits1(word):
   "All edits that are one edit away from `word`."
   letters = 'abcdefghijklmnopqrstuvwxyz'
   splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes = [L + R[1:] for L, R in splits if R]
   transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
   replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
   inserts = [L + c + R for L, R in splits for c in letters]
   return set(deletes + transposes + replaces + inserts)
	
def edits2(word):
   "All edits that are two edits away from `word`."
   return (e2 for e1 in edits1(word) for e2 in edits1(e1))
   print(correction('speling'))
   print(correction('korrectud'))

In this program, we will match with “big.txt” which includes corrected words. Words match with words included in text file and print the appropriate results accordingly.

Output

The above code will generate the following output −

Code Will Generate

Useful Video Courses


Video

Agile Methodology

14 Lectures 1 hours

Mahesh Kumar

Video

Agile Project Management: Scrum Step by Step with Examples

61 Lectures 1 hours

Paul Ashun

Video

Agile Vs Waterfall project methodologies comparison

7 Lectures 25 mins

Angelo Tofalo

Video

Agile for Security Teams

19 Lectures 1.5 hours

Cristina Gheorghisan

Video

Scrum Testing: Learn Agile and Scrum Testing from A to Z NOW

26 Lectures 1.5 hours

Dejan Majkic

Advertisements