 71802543 - artificial intelligence (ai), data mining, expert system software, genetic programming, machine learning, deep learning, neural networks and another modern computer technologies concepts. brain representing artificial intelligence with printed circuit boa

## Explanatory data Analysis with Publicly Available Data Set :

This is the continuation of the series of blog to get started in Datascience using python . In the last post we seen the fundamentals of using Pandas .

Here we will discuss more about descriptive analysis of a publicly available data set to get some intuition and what can be done when face with real dataset .

Descriptive analysis are used to describe the basic feature of the data , it provides summaries about the data and what inference can be drawn .

Lets get started .

We will use the famous Titanic Data Set (http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls.) for setting the ground to do the analysis . So we have successfully loaded the dataset .

Next , the objective of the dataset was to predict the accuracy how many of the passenger survived , first of all we need to check whether the dataset is balance dataset or imbalance . Most of us makes this mistake of not checking it upfront .

• survival: Survival (0 = no; 1 = yes)

It is the column for checking the class : So now we know the count of each class .

Checking the count of each column : Checking data types of each column : To Check which particular column holds categorical variable : Missing value causes havoc to machine learning models lets check the missing value of each column : How to handle missing value in column :

The highlighted part shows the number of missing value in each column , so its depends on the problem whether to eliminate it or fill it .

Lets look at some techniques of doing it :

So after checking which has the maximum missing value we need to drop the column :

As column ‘body’, ‘cabin’ and boat have the maximum missing value we can drop it To Check the data frame whether data is being drop or not : Filling missing value : And now droping of rows with missing value : With we have successfully eliminated missing value from our data frame .

Now comes the most important function in pandas to get summaries of each numerical column

With dataframe.describe()

Will explain it in details and what is the interpretation . Lets dig deeper with each row and what it means :

• count :Count number of non-NA/null observations.
• Mean : average of the values
• Std :Standard deviation(https://en.wikipedia.org/wiki/Standard_deviation) of the obervations how dispersed the value are
• Min : minimum value in the the observations
• Max : maximum value in the observations
• Percentile :i.e 25 %, 50 % ,75 % – the percentile indicates the value below, which is given the percentage of observation in group of observations

i.e 25% of survival have age less 21 , 50 % have less than 28 and 75% have less than 39 years

Other great tool to do explanatory data analysis is by using pandas_profiling(https://pypi.org/project/pandas-profiling/)

Lets check it out :     So you must be feeling confident if you have reached this section of the post :

Go and get your hands dirty!!

Stay tune for your next blog on visualization .

Happy Learning and Sharing !!