Big Data Analytics - Data Collection


Data collection plays the most important role in the Big Data cycle. The Internet provides almost unlimited sources of data for a variety of topics. The importance of this area depends on the type of business, but traditional industries can acquire a diverse source of external data and combine those with their transactional data.

For example, let’s assume we would like to build a system that recommends restaurants. The first step would be to gather data, in this case, reviews of restaurants from different websites and store them in a database. As we are interested in raw text, and would use that for analytics, it is not that relevant where the data for developing the model would be stored. This may sound contradictory with the big data main technologies, but in order to implement a big data application, we simply need to make it work in real time.

Twitter Mini Project

Once the problem is defined, the following stage is to collect the data. The following miniproject idea is to work on collecting data from the web and structuring it to be used in a machine learning model. We will collect some tweets from the twitter rest API using the R programming language.

First of all create a twitter account, and then follow the instructions in the twitteR package vignette to create a twitter developer account. This is a summary of those instructions −

  • Go to and log in.

  • After filling in the basic info, go to the "Settings" tab and select "Read, Write and Access direct messages".

  • Make sure to click on the save button after doing this

  • In the "Details" tab, take note of your consumer key and consumer secret

  • In your R session, you’ll be using the API key and API secret values

  • Finally run the following script. This will install the twitteR package from its repository on github.

install.packages(c("devtools", "rjson", "bit64", "httr"))  

# Make sure to restart your R session at this point 

We are interested in getting data where the string "big mac" is included and finding out which topics stand out about this. In order to do this, the first step is collecting the data from twitter. Below is our R script to collect required data from twitter. This code is also available in bda/part1/collect_data/collect_data_twitter.R file.

rm(list = ls(all = TRUE)); gc() # Clears the global environment
Sys.setlocale(category = "LC_ALL", locale = "C")

### Replace the xxx’s with the values you got from the previous instructions

# consumer_key = "xxxxxxxxxxxxxxxxxxxx"
# consumer_secret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# access_token = "xxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# access_token_secret= "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# Connect to twitter rest API
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_token_secret)

# Get tweets related to big mac
tweets <- searchTwitter(’big mac’, n = 200, lang = ’en’)
df <- twListToDF(tweets)

# Take a look at the data

# Check which device is most used
sources <- sapply(tweets, function(x) x$getStatusSource())
sources <- gsub("</a>", "", sources)
sources <- strsplit(sources, ">")
sources <- sapply(sources, function(x) ifelse(length(x) > 1, x[2], x[1]))
source_table = table(sources)
source_table = source_table[source_table > 1]
freq = source_table[order(source_table, decreasing = T)]

#                       Frequency
# Twitter for iPhone       71
# Twitter for Android      29
# Twitter Web Client       25
# recognia                 20