- Home
- Introduction
- Case Study
- Setting up a Project
- Getting Data
- Restructuring Data
- Preparing Data
- Splitting Data
- Building Classifier
- Testing
- Limitations
- Summary
Logistic Regression in Python - Resources
Logistic Regression in Python - Restructuring Data
Whenever any organization conducts a survey, they try to collect as much information as possible from the customer, with the idea that this information would be useful to the organization one way or the other, at a later point of time. To solve the current problem, we have to pick up the information that is directly relevant to our problem.
Displaying All Fields
Now, let us see how to select the data fields useful to us. Run the following statement in the code editor.
print(list(df.columns))
You will see the following output −
['age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'y']
The output shows the names of all the columns in the database. The last column y is a Boolean value indicating whether this customer has a term deposit with the bank. The values of this field are either y or n. You can read the description and purpose of each column in the banks-name.txt file that was downloaded as part of the data.
Eliminating Unwanted Fields
Examining the column names, you will know that some of the fields have no significance to the problem at hand. For example, fields such as month, day_of_week, campaign, etc. are of no use to us. We will eliminate these fields from our database. To drop a column, we use the drop command as shown below −
#drop columns which are not needed. df.drop(df.columns[[0, 3, 5, 8, 9, 10, 11, 12, 13, 14]], axis = 1, inplace = True)
The command says that drop column number 0, 3, 5, 8, and so on. To ensure that the index is properly selected, use the following statement −
df.columns[1] 'job'
This prints the column name for the given index.
After dropping the columns which are not required, examine the data with the head statement. The screen output is shown here −
df.head() job marital default housing loan poutcome y 0 unemployed married no no no unknown no 1 services married no yes yes failure no 2 management single no yes no failure no 3 management married no yes yes unknown no 4 blue-collar married no yes no unknown no
Now, we have only the fields which we feel are important for our data analysis and prediction. The importance of Data Scientist comes into picture at this step. The data scientist has to select the appropriate columns for model building.
For example, the type of job though at the first glance may not convince everybody for inclusion in the database, it will be a very useful field. Not all types of customers will open the TD. The lower income people may not open the TDs, while the higher income people will usually park their excess money in TDs. So the type of job becomes significantly relevant in this scenario. Likewise, carefully select the columns which you feel will be relevant for your analysis.
In the next chapter, we will prepare our data for building the model.