Explanatory data Analysis with Publicly Available Data Set :
71802543 - artificial intelligence (ai), data mining, expert system software, genetic programming, machine learning, deep learning, neural networks and another modern computer technologies concepts. brain representing artificial intelligence with printed circuit boa

Explanatory data Analysis with Publicly Available Data Set :

This is the continuation of the series of blog to get started in Datascience using python . In the last post we seen the fundamentals of using Pandas .

Here we will discuss more about descriptive analysis of a publicly available data set to get some intuition and what can be done when face with real dataset .

Descriptive analysis are used to describe the basic feature of the data , it provides summaries about the data and what inference can be drawn .

Lets get started .

We will use the famous Titanic Data Set (http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls.) for setting the ground to do the analysis .

First get it downloaded and keep it on your local directory and load it in pandas data frame :

So we have successfully loaded the dataset .

Next , the objective of the dataset was to predict the accuracy how many of the passenger survived , first of all we need to check whether the dataset is balance dataset or imbalance . Most of us makes this mistake of not checking it upfront .

  • survival: Survival (0 = no; 1 = yes)

It is the column for checking the class :

So now we know the count of each class .

Checking the count of each column :

Checking data types of each column :

To Check which particular column holds categorical variable :

Missing value causes havoc to machine learning models lets check the missing value of each column :

How to handle missing value in column :

The highlighted part shows the number of missing value in each column , so its depends on the problem whether to eliminate it or fill it .

Lets look at some techniques of doing it :

So after checking which has the maximum missing value we need to drop the column :

As column ‘body’, ‘cabin’ and boat have the maximum missing value we can drop it

To Check the data frame whether data is being drop or not :

Filling missing value :

And now droping of rows with missing value :

With we have successfully eliminated missing value from our data frame .

 Now comes the most important function in pandas to get summaries of each numerical column

With dataframe.describe()

Will explain it in details and what is the interpretation .

Lets dig deeper with each row and what it means :

  • count :Count number of non-NA/null observations.
  • Mean : average of the values
  • Std :Standard deviation(https://en.wikipedia.org/wiki/Standard_deviation) of the obervations how dispersed the value are
  • Min : minimum value in the the observations
  • Max : maximum value in the observations
  • Percentile :i.e 25 %, 50 % ,75 % – the percentile indicates the value below, which is given the percentage of observation in group of observations

i.e 25% of survival have age less 21 , 50 % have less than 28 and 75% have less than 39 years

Other great tool to do explanatory data analysis is by using pandas_profiling(https://pypi.org/project/pandas-profiling/)

 

Lets check it out :

 You can download the report

So you must be feeling confident if you have reached this section of the post :

Go and get your hands dirty!!

Stay tune for your next blog on visualization .

Happy Learning and Sharing !!

Instructor lead Online training in Datascience join our whatsapp group

Upgrade your skills join our community

Leave a Reply

Close Menu