What is data processing in data science?


Data processing, data cleaning and data wrangling in data science are one and the same. The process of converting or mapping data from initial 'raw' form into another format, in order to prepare the data for further analysis.

What is Data Processing in data science?
Data Science in Data Science

The main objectives of data processing in data science are:

  • Identifying and handling missing values.
  • Data Formatting
  • Data Normalization
  • Data Binning
  • Turning categorical values into numerical values.

Let us this in some brief:

1. Missing Values

In Data Science Projects, some times the data given to us has some missing values such as '?', 'NAN', '0' or just blanks. What should we do in such cases? The first choice is we can drop that variable, drop the data entry.

The second choice is to replace such data's with the average (mean) of similar data points. 
If the missing data is a categorical data replace it with the frequency that is the mode value. We can replace it based on other functions. Last choice is to leave the data as missing and come back latter if any relationship is developed.

How to drop missing values in Python?

Say, df is the dataframe, we can drop the missing values as follows:

df.dropna(), Inside the round functions we have to give arguments as follows: axis=0 which drops the missing value in the entire row and axis=1 which drops the missing value in the entire column.

How to replace the missing values?

df.repalce(missing_value, new_value)

2. Data Formatting

Data's are collected from different places and are in different formats. Bringing the data into common standard expression allows users to make meaningful comparison. For instance, the weight in India is calculated in Kilograms while we have got the data in Pounds. So we will convert it into Kilograms. 

We know that price is a numerical value, but we check the data type it shows as object. We should convert the type of price to integer or float as per the requirements.

To identify the datatype = df.dtype()  
To convert the datatype = df.astype()

3. Data Normalization

Here we uniform the features value in different range so that we can obtain a correct relationship between them. For instance we have length ranging from 500 meters to 700 meters while width ranging from 20 meter to 50 meter.  As you can see one value is too large while other is small, so obtaining correct relationship is not possible.

For this we have to normalize the data to bring both of them in one frame to identifying we there is any relationship between them.

We can normalize the features as follow:

1) Xnew = Xold/Xmax # Simple feature scaling

2) Xnew = (Xold-Xmin)/(Xmax-Xmin) # minimum and maximum feature scalling

3) Xnew = (Xold-Mu)/Sigma #Z-Score

4. Data Binning

Here we group the values into bins. We convert the numerical values into categorical values. Example: Say we have got 'age' column in our dataframe. We convert the numerical values into categorical values using binning . 

Age's between 0 to 17 are categorized as children , 18 to 59 are categorized as Adults and ages above 60 are categorized as old. This is called binning. 

5. Turning categorical values into numerical value

To develop models in machine learning we can not use categorical values. So we have to convert the categorical values into numerical values. We use the pandas get_dummies function to do this. Example: Say we have a problem where the gender is either male or female. 
We can convert male and female into numerical values '0' and '1' and then use it in our model.

Finally our data processing, data cleaning and data wrangling in data science is over and we can move to the next of data exploration.

Comments