What is Exploratory Data Analysis in Python?


Exploratory Data Analysis in Python helps us in the following way:

  1. It helps us to summarize the main characteristics of the data.
  2. Gain better understanding of the data set.
  3. To uncover relationships between variables.
  4. To extract important variables.

Exploratory Data Analysis In Python
Exploratory data analysis in python

We can do the exploratory analysis in the following ways:

1. Descriptive Analysis

i) describe()

It describes the basic features of the data and gives short summaries about the sample and measure of the data.

df is our dataframe.

df.describe() - It describes all the numerical data
df.describe(include='All') - It describes all the data including object type data which are not numerical.

ii) value_counts()

Say we have categorical data and want there count we can use value_count() for that.

d=df["Gender"].value_counts()  - It will give us two categories male and female and their counts.
Male=40
Female=42

iii) Box plots

syntax: sns.boxplot(x='country', y='population, data=df)

It will give us the box plot which helps us to explore relationship between country and population.

The box plot gives us the median, 25th percentile, 75th percentile, upper extreme, lower extreme and the outliers.

From this we can easily understand the distribution of the data, whether it is skewed or not and identify the outliers.

iv) Scatter plots 

When two features are continuous, we can use scatter plots to check the relationship between them, whether the relationship is positive or negative. 

x=df['car_color']
y=df['price']
plt.scatter(x,y)

2) groupby(

Groupby is applied to categorical variables. The data is grouped based on one or several variables and analysis is performed on the individual groups.

3) Correlation and Causation

Correlation is a measure of the extent of interdependence between two variables. 

Causation is the relationship between cause and effect of two variables. 

To get Correlation in a dataframe df we simply  write df.corr()

We get values between 1 and -1. These are the Pearson co-efficient. 

The Pearson Correlation measures the linear dependence between two variables x and y.

1 : indicates total positive linear correlation.

0 : indicates No linear correlation, the two variables most likely do not affect each other.

-1 : indicates total negative linear correlation.

Then we arrive to P-value. The P-value is the probability value that correlation between two variables is statistically significant. Normally we choose a value of 0.05, which means that we are 95% confident that the correlation between the variables are significant.

p-value is < 0.001 : we say that there is a strong evidence that the correlation is significant.

p-value is < 0.05 : we say that there is moderate evidence that the correlation is significant.

p-value is < 0.1 : we say that there is weak  evidence that the correlation is significant.

p-value is > 0.1 : we say that there is no evidence that the correlation is significant.

4. ANOVA :Analysis of Variance 

The Analysis of Variance is a statistical method used to test whether there is significant difference between the means of two or more group. 

F-test score - it assumes the means of all the group as the same. It calculates how much the actual means deviate from the the assumption and reports it as the F-test score. A larger F1-score indicates that there is larger difference between the means. 

P-value tells us how statistically significant is our calculated score. 

Thank You for reading the exploratory analysis in python.

Comments