How to Convert to Best Data Types Automatically in Pandas?


Pandas is a popular data manipulation library in Python, used for cleaning and transforming data. It provides various functionalities for converting data types, such as the astype() method. However, manually converting data types can be time−consuming and prone to errors.

To address this, Pandas introduced a new feature in version 1.0 called convert_dtypes(), which allows automatic conversion of columns to their best−suited data types based on the data in the column. This feature eliminates the need for manual type conversion and ensures that the data is appropriately formatted.

Converting the Datatype of a Pandas Series

Consider the code shown below in which we will be converting the datatype of a Pandas series.

Example

import pandas as pd

# Create a Series with mixed data types
data = pd.Series(['1', '2', '3.1', '4.0', '5'])

# Print the data types of the Series
print("Original data types:")
print(data.dtypes)

# Convert the Series to the best data type automatically
data = pd.to_numeric(data, errors='coerce')

# Print the data types of the Series after conversion
print("\nNew data types:")
print(data.dtypes)

# Print the updated Series
print("\nUpdated Series:")
print(data)

Explanation

  • Import the Pandas library using the import statement.

  • Create a Pandas Series named data with mixed data types, including integers and strings.

  • Print the original data types of the Series using the dtypes attribute.

  • Use the pd.to_numeric() method to automatically convert the Series to the best data type.

  • Pass the errors parameter with the value 'coerce' to force any invalid values to be converted to NaN.

  • Print the new data types of the Series using the dtypes attribute.

  • Print the updated Series.

To run the above code, we need to run the command shown below.

Command

python3 main.py

Output

Original data types:
object

New data types:
float64

Updated Series:
0    1.0
1    2.0
2    3.1
3    4.0
4    5.0
dtype: float64

Converting the datatype of a Pandas DataFrame

Consider the code shown below

Example

import pandas as pd

# create a sample dataframe with mixed data types
data = {'name': ['John', 'Marry', 'Peter', 'Jane', 'Paul'],
        'age': [25, 30, 40, 35, 27],
        'gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
        'income': ['$500', '$1000', '$1200', '$800', '$600']}
df = pd.DataFrame(data)

# print the original data types of the dataframe
print("Original data types:\n", df.dtypes)

# convert 'age' column to float
df['age'] = df['age'].astype(float)

# convert 'income' column to integer by removing the dollar sign
df['income'] = df['income'].str.replace('$', '').astype(int)

# print the new data types of the dataframe
print("\nNew data types:\n", df.dtypes)
print("\nDataFrame after conversion:\n", df)

Explanation

  • First, we import the necessary libraries: Pandas.

  • We create a sample DataFrame with mixed data types including object, int64, and string values.

  • We print the original data types of the DataFrame using the dtypes attribute.

  • We convert the 'age' column to float using the astype() method, which converts the column data type to the specified type.

  • We convert the 'income' column to integer by removing the dollar sign using the str.replace() method and then converting the string to integer using the astype() method.

  • We print the new data types of the DataFrame using the dtypes attribute to confirm the data type conversion.

  • Finally, we print the entire DataFrame to see the converted data types.

Note: The astype() method is used for converting a Series to a specified data type while the astype() method of DataFrame is used for converting the data type of multiple columns.

Output

Original data types:
 name      object
age        int64
gender    object
income    object
dtype: object

New data types:
 name       object
age       float64
gender     object
income      int64
dtype: object

DataFrame after conversion:
     name   age  gender  income
0   John  25.0    Male     500
1  Marry  30.0  Female    1000
2  Peter  40.0    Male    1200
3   Jane  35.0  Female     800
4   Paul  27.0    Male     600

Conclusion

In conclusion, converting data types is an essential task in data analysis and manipulation. Pandas provides us with various methods to convert data types, such as specifying the data type while loading the data, using the astype() method to convert series or dataframes, and using the infer_objects() method to automatically detect the best data type for each column.

It is essential to choose the appropriate data type for each column to optimise memory usage and improve data analysis performance.

Updated on: 03-Aug-2023

537 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements