This is the continuation of the series of blog to get started in Datascience using python . In the last post we seen the fundamentals of using Pandas .

Here we will discuss more about descriptive analysis of a publicly available data set to get some intuition and what can be done when face with real dataset .

Descriptive analysis are used to describe the basic feature of the data , it provides summaries about the data and what inference can be drawn .

Lets get started .

We will use the famous Titanic Data Set (http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls.) for setting the ground to do the analysis .

First get it downloaded and keep it on your local directory and load it in pandas data frame :

So we have successfully loaded the dataset .

Next , the objective of the dataset was to predict the accuracy how many of the passenger survived , first of all we need to check whether the dataset is balance dataset or imbalance . Most of us makes this mistake of not checking it upfront .

- survival: Survival (0 = no; 1 = yes)

It is the column for checking the class :

So now we know the count of each class .

**Checking the count of each column :**

**Checking data types of each column :**

**To Check which particular column holds categorical variable :**

**Missing value causes havoc to machine learning models lets check the missing value of each column :**

**How to handle missing value in column :**

**The highlighted part shows the number of missing value in each column , so its depends on the problem whether to eliminate it or fill it .**

**Lets look at some techniques of doing it :**

**So after checking which has the maximum missing value we need to drop the column :**

**As column ‘body’, ‘cabin’ and boat have the maximum missing value we can drop it **

**To Check the data frame whether data is being drop or not :**

**Filling missing value :**

**And now droping of rows with missing value :**

**With we have successfully eliminated missing value from our data frame .**

** ****Now comes the most important function in pandas to get summaries of each numerical column **

**With dataframe.describe() **

**Will explain it in details and what is the interpretation .**

**Lets dig deeper with each row and what it means :**

**count :Count number of non-NA/null observations.****Mean : average of the values****Std :Standard deviation(**https://en.wikipedia.org/wiki/Standard_deviation**) of the obervations how dispersed the value are****Min : minimum value in the the observations****Max : maximum value in the observations****Percentile :i.e 25 %, 50 % ,75 % – the percentile indicates the value below, which is given the percentage of observation in group of observations**

**i.e 25% of survival have age less 21 , 50 % have less than 28 and 75% have less than 39 years**

**Other great tool to do explanatory data analysis is by using pandas_profiling(**https://pypi.org/project/pandas-profiling/**)**

** **

**Lets check it out :**

** ****You can download the report **

So you must be feeling confident if you have reached this section of the post :

Go and get your hands dirty!!

Stay tune for your next blog on visualization .

Happy Learning and Sharing !!