# Big Data Analytics - Cleansing Data

Once the data is collected, we normally have diverse data sources with different characteristics. The most immediate step would be to make these data sources homogeneous and continue to develop our data product. However, it depends on the type of data. We should ask ourselves if it is practical to homogenize the data.

Maybe the data sources are completely different, and the information loss will be large if the sources would be homogenized. In this case, we can think of alternatives. Can one data source help me build a regression model and the other one a classification model? Is it possible to work with the heterogeneity on our advantage rather than just lose information? Taking these decisions are what make analytics interesting and challenging.

In the case of reviews, it is possible to have a language for each data source. Again, we have two choices −

• Homogenization − It involves translating different languages to the language where we have more data. The quality of translations services is acceptable, but if we would like to translate massive amounts of data with an API, the cost would be significant. There are software tools available for this task, but that would be costly too.

• Heterogenization − Would it be possible to develop a solution for each language? As it is simple to detect the language of a corpus, we could develop a recommender for each language. This would involve more work in terms of tuning each recommender according to the amount of languages available but is definitely a viable option if we have a few languages available.

In the present case we need to first clean the unstructured data and then convert it to a data matrix in order to apply topics modelling on it. In general, when getting data from twitter, there are several characters we are not interested in using, at least in the first stage of the data cleansing process.

For example, after getting the tweets we get these strange characters: "<ed><U+00A0><U+00BD><ed><U+00B8><U+008B>". These are probably emoticons, so in order to clean the data, we will just remove them using the following script. This code is also available in bda/part1/collect_data/cleaning_data.R file.

rm(list = ls(all = TRUE)); gc() # Clears the global environment
# Some tweets
head(df$text) [1] "I’m not a big fan of turkey but baked Mac & cheese <ed><U+00A0><U+00BD><ed><U+00B8><U+008B>" [2] "@Jayoh30 Like no special sauce on a big mac. HOW" ### We are interested in the text - Let’s clean it! # We first convert the encoding of the text from latin1 to ASCII df$text <- sapply(df$text,function(row) iconv(row, "latin1", "ASCII", sub = "")) # Create a function to clean tweets clean.text <- function(tx) { tx <- gsub("htt.{1,20}", " ", tx, ignore.case = TRUE) tx = gsub("[^#[:^punct:]]|@|RT", " ", tx, perl = TRUE, ignore.case = TRUE) tx = gsub("[[:digit:]]", " ", tx, ignore.case = TRUE) tx = gsub(" {1,}", " ", tx, ignore.case = TRUE) tx = gsub("^\\s+|\\s+$", " ", tx, ignore.case = TRUE)
return(tx)
}

clean_tweets <- lapply(df\$text, clean.text)

# Cleaned tweets