Data Science is the massive sector and it combines a lot of topic . There are no hard and fast rule to master data science but there are key topics if given attention will help many of the aspiring data science people to know in depth about the topics .
We will discussed here some of the fundamentals some pre requisite in data science :
Statistics forms the backbone of datascience . If you can consider as Data Science is the truck than statistics can be term as its fuel . will concentrate on two aspect of statistics:
Descriptive Statistics : It is all about summarizing the data ,it helps you to understand the data, some key topics to understand descriptive analysis is as follows
Central Tendancy : From the name suggestion central Tendency represents central point to a sample of the data .
There are three parts of central tendency mean , median and mode .
Mean is the average of all the data points = Sum of all data points / no of sample .
Median is the middle value of the data sample arrange in order of their ascending magnitude . it is easier to find middle value in case of odd number , in case even number we take the the mean of two middle number
Even – (1,2,3,4,4,5)- Median = (3+4)/2= 3.5
Mode – maximum observable frequency of a number is the mode .
Normal Distribution : The normal distribution is a probability function that describes how the values of a variable are distributed. It is a symmetric distribution where most of the observations cluster around the central peak and the probabilities for values further away from the mean taper off equally in both directions. Extreme values in both tails of the distribution are similarly unlikely.
Example : distribution of height in a class : Suppose in a class of 20 students the average height is 1.5 meters, the taller people have a height of 2 meter and the consider the shorter one as 1 meter .So this representation depicts the normal distribution with mean height the centre and standard deviation at the side of the curve .
Skewness & Kurtosis : Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.
Example of anamoly detection .
Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack of outliers. A uniform distribution would be the extreme case.
Standard Deviation : https://www.mathsisfun.com/data/standard-deviation.html
The formula is easy: it is the square root of the Variance. So now you ask, “What is the Variance?”
Lets derive with simple maths standard Deviation :
Consider a set of number : (-7,-6,3,4,7,8)
Mean = Sum of number / n = -7+-6+3+4+7+8
9/6 = 1.5
Variance is Summation of square of the difference from mean /n
Standard Deviation is square root of the variance .
Higher the standard deviation more scattered are data from mean .
Inferential Statistics is all about inferring and getting conclusion from data :
Sampling of data from large chunk and making inference on top of it .
Central Limit theorm – According to the central limit theorem, the mean of the sample is the same as that of the entire population. This also means that the standard deviation of the sample will be equal to the standard deviation of the population. And finally, with the increase in the sample size, the standard errors will become lower resulting in a more normal shaped curve. This will also help in determining the mean of the population more accurately.
Hypothesis testing – Hypothesis testing is the measure of testing an assumption. It is used to infer the results of the hypothesis performed on sample group from a larger group or population. The hypothesis that we need to test is called the Null Hypothesis and the hypothesis against which we need to test is known as an Alternate Hypothesis. Null Hypothesis is often the ideal case that we need to test.
For example, suppose we are surveying two groups of people – one group that drinks and the other which does not drinks. We assume that the mean of liver failure patients in the drinking group is the same as the mean of liver failure patients in the non-smoking group. This is our null hypothesis that we need to test and decide if we have to reject this hypothesis or not.
Conversely, our alternate hypothesis would be: The number of liver failure patients in the drinking group is much more than the non-drinking group. Therefore, the mean of the liver failure patients in the drinking group is much higher than the non-drinking group. Based on the given data and the evidence, we are able to test the two hypotheses and conclude if we need to reject the null hypothesis or we do not need to reject the null hypothesis.
The hypothesis is tested in four steps-
- We need to explain both the null and alternate hypothesis so that one of it can be rejected.
- Evaluating the data through an analysis plan.
- Compute the test statistic and physically analyze the sample data.
- Finally, we interpret the result and reject one of the two hypotheses.
Correlation – Correlation is relationship of two random variable There are three types of correlation – Positive Correlation, Negative Correlation, and Zero Correlation. A positive correlation means that there is a relationship between two variables that enables them to increase and decrease together. In a negative correlation, an increment in one form of variable causes a decrement in another variable. Whereas in a zero correlation, there is absolutely no relation between the two variables.
Some common examples of correlation are – There is a positive correlation between people who eat more and obesity. Similarly, there is a negative correlation between people spending their time to exercise and their weight gain.
Regression – Regression is a statistical technique for estimating the relationships among variables. Regression can be simple regression, multiple regression based on the number of independent variables. Furthermore, if the function used is non-linear in nature, then the type of regression is called non-linear regression.
Linear Algebra :
Vector – vector is a general term with many uses. In this case, think of it as a list of values or a row in a table. The data structure is a 1-dimensional array; a vector of N elements is an N-dimensional vector, one dimension for each element.
the input (3.14159, 2.71828, 1.618) is a vector of 3 elements, and could be represented as a point in 3-dimensional space. Your program would declare a 1×3 array (one-dimensional data structure) to hold the three items.
We can represent a vector in Python as a NumPy array.
A NumPy array can be created from a list of numbers. For example, below we define a vector with the length of 3 and the integer values 1, 2 and 3.
from numpy import array v=array[1,2,3]
Matrix A matrix is a two-dimensional collection of numbers. We will represent matrices as lists of lists, with each inner list having the same size and representing a row of the matrix. If A is a matrix, then A[i][j] is the element in the ith row and the jth column. Per mathematical convention, we will typically use capital letters to represent matrices. For example:
Mainly used in image process and basic operation :
A = [[1, 2, 3],
[4, 5, 6]]
B = [[1, 2],