In Statistical hypothesis testing, the P-value or sometimes called probability value, is used to observe the test results or more extreme results by assuming that the null hypothesis (H0) is true. In data science, there are lots of concepts that are borrowed from different disciplines, and the p-value is one of them. The concept of p-value comes from statistics and widely used in machine learning and data science.
In Statistics, our main goal is to determine the statistical significance of our result, and this statistical significance is made on below three concepts:
Let's understand each of them.
Hypothesis testing can be defined between two terms; Null hypothesis and Alternative Hypothesis. It is used to check the validity of the null hypothesis or claim made using the sample data. Here, the null hypothesis (H0) is defined as the hypothesis with no statistical significance between two variables, while an alternative hypothesis is defined as the hypothesis with a statistical significance between the two variables. No significant relationship between the two variables tells that one variable will not affect the other variable. Thus, the Null hypothesis tells that what you are going to prove doesn't actually happen. If the independent variable doesn't affect the dependent variable, then it shows the alternative hypothesis condition.
In a simple way, we can say that in hypothesis testing, first, we make a claim that is assumed as a null hypothesis using the sample data. If this claim is found invalid, then the alternative hypothesis is selected This assumption or claim is validated using the p-value to see if it is statistically significant or not using the evidence. If the evidence supports the alternative hypothesis, then the null hypothesis is rejected.
Below are the steps to perform an experiment for hypothesis testing:
The normal distribution, which is also known as Gaussian distribution, is the Probability distribution function. It is symmetric about the mean, and use to see the distribution of data using a graph plot. It shows that data near the mean is more frequent to occur as compared to data which is far from the mean, and it looks like a bell-shaped curve. The two main terms of the normal distribution are mean(μ) and standard deviation(σ). For a normal distribution, the mean is zero, and the standard deviation is 1.
In hypothesis testing, we need to calculate z-score. Z-score is the number of standard deviations from the mean of data-point.
Here, the z-score inform us that where the data lies compared to the average population.
To determine the statistical significance of the hypothesis test is the goal of calculating the p-value. To do this, first, we need to set a threshold, which is said to be alpha. We should always set the value of alpha before the experiment, and it is set to be either 0.05 or 0.01(depending on the type of problem).
The result is concluded as a significant result if the observed p-value is lower than alpha.
Silan Software is one of the India's leading provider of offline & online training for Java, Python, AI (Machine Learning, Deep Learning), Data Science, Software Development & many more emerging Technologies.
We provide Academic Training || Industrial Training || Corporate Training || Internship || Java || Python || AI using Python || Data Science etc