This is the continuation of the series of blog to get started in Datascience using python . In the last post we seen the fundamentals of using Pandas .
Here we will discuss more about descriptive analysis of a publicly available data set to get some intuition and what can be done when face with real dataset .
Descriptive analysis are used to describe the basic feature of the data , it provides summaries about the data and what inference can be drawn .
Lets get started .
We will use the famous Titanic Data Set (http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls.) for setting the ground to do the analysis .
First get it downloaded and keep it on your local directory and load it in pandas data frame :
So we have successfully loaded the dataset .
Next , the objective of the dataset was to predict the accuracy how many of the passenger survived , first of all we need to check whether the dataset is balance dataset or imbalance . Most of us makes this mistake of not checking it upfront .
- survival: Survival (0 = no; 1 = yes)
It is the column for checking the class :
So now we know the count of each class .
Checking the count of each column :
Checking data types of each column :
To Check which particular column holds categorical variable :
Missing value causes havoc to machine learning models lets check the missing value of each column :
How to handle missing value in column :
The highlighted part shows the number of missing value in each column , so its depends on the problem whether to eliminate it or fill it .
Lets look at some techniques of doing it :
So after checking which has the maximum missing value we need to drop the column :
As column ‘body’, ‘cabin’ and boat have the maximum missing value we can drop it
To Check the data frame whether data is being drop or not :
Filling missing value :
And now droping of rows with missing value :
With we have successfully eliminated missing value from our data frame .
Now comes the most important function in pandas to get summaries of each numerical column
Will explain it in details and what is the interpretation .
Lets dig deeper with each row and what it means :
- count :Count number of non-NA/null observations.
- Mean : average of the values
- Std :Standard deviation(https://en.wikipedia.org/wiki/Standard_deviation) of the obervations how dispersed the value are
- Min : minimum value in the the observations
- Max : maximum value in the observations
- Percentile :i.e 25 %, 50 % ,75 % – the percentile indicates the value below, which is given the percentage of observation in group of observations
i.e 25% of survival have age less 21 , 50 % have less than 28 and 75% have less than 39 years
Other great tool to do explanatory data analysis is by using pandas_profiling(https://pypi.org/project/pandas-profiling/)
Lets check it out :
You can download the report
So you must be feeling confident if you have reached this section of the post :
Go and get your hands dirty!!
Stay tune for your next blog on visualization .
Happy Learning and Sharing !!