Comparing sample distributions with the Kolmogorov-Smirnov (KS) test (2025)

How to compare samples and understand if they come from the same distribution using python

Comparing sample distributions with the Kolmogorov-Smirnov (KS) test (1)

Published in

Towards Data Science

·

7 min read

·

Feb 7, 2022

--

Comparing sample distributions with the Kolmogorov-Smirnov (KS) test (3)

Imagine you have two sets of readings from a sensor, and you want to know if they come from the same kind of machine. How do you compare those distributions?

The quick answer is: you can use the 2 sample Kolmogorov-Smirnov (KS) test, and this article will walk you through this process.

Often in statistics we need to understand if a given sample comes from a specific distribution, most commonly the Normal (or Gaussian) distribution. For this intent we have the so-called normality tests, such as Shapiro-Wilk, Anderson-Darling or the Kolmogorov-Smirnov test.

All of them measure how likely a sample is to have come from a normal distribution, with a related p-value to support this measurement.

The Kolmogorov-Smirnov test, however, goes one step further and allows us to compare two samples, and tells us the chance they both come from the same distribution. This test is really useful for evaluating regression and classification models, as will be explained ahead.

In a simple way we can define the KS statistic for the 2-sample test as the greatest distance between the CDFs (Cumulative Distribution Function) of each sample.

Comparing sample distributions with the Kolmogorov-Smirnov (KS) test (4)

On the image above the blue line represents the CDF for Sample 1 (F1(x)), and the green line is the CDF for Sample 2 (F2(x)). The Kolmogorov-Smirnov statistic D is given by

Comparing sample distributions with the Kolmogorov-Smirnov (KS) test (5)

with n as the number of observations on Sample 1 and m as the number of observations in Sample 2. We then compare the KS statistic with the respective KS distribution to obtain the p-value of the test.

Comparing sample distributions with the Kolmogorov-Smirnov (KS) test (6)

The KS Distribution for the two-sample test depends of the parameter en, that can be easily calculated with the expression

Comparing sample distributions with the Kolmogorov-Smirnov (KS) test (7)

All right, the test is a lot similar to other statistic tests. But in order to calculate the KS statistic we first need to calculate the CDF of each sample.

The function cdf(sample, x) is simply the percentage of observations below x on the sample. We can evaluate the CDF of any sample for a given value x with a simple algorithm:

  • Sort the sample
  • Count how many observations within the sample are lesser or equal to x
  • Divide by the total number of observations on the sample

Or with the equivalent python code:

As I said before, the KS test is largely used for checking whether a sample is normally distributed. We can use the KS “1-sample” test to do that.

The scipy.stats library has a ks_1samp function that does that for us, but for learning purposes I will build a test from scratch. The codes for this are available on my github, so feel free to skip this part.

To build the ks_norm(sample)function that evaluates the KS 1-sample test for normality, we first need to calculate the KS statistic comparing the CDF of the sample with the CDF of the normal distribution (with mean = 0 and variance = 1). Then we can calculate the p-value with KS distribution for n = len(sample) by using the Survival Function of the KS distribution scipy.stats.kstwo.sf[3]:

Easy like that.

Now we need some samples to test it:

# Create random samples
norm_a = np.random.normal(loc = 0, scale = 1, size = 500)
norm_b = np.random.normal(loc = 0.1, scale = 1, size = 500)
norm_c = np.random.normal(loc = 3, scale = 1, size = 500)
f_a = np.random.f(dfnum = 5, dfden = 10, size = 500)
Comparing sample distributions with the Kolmogorov-Smirnov (KS) test (8)

The samples norm_a and norm_b come from a normal distribution and are really similar. The sample norm_c also comes from a normal distribution, but with a higher mean. The f_a sample comes from a F distribution.

It is important to standardize the samples before the test, or else a normal distribution with a different mean and/or variation (such as norm_c) will fail the test. We can now perform the KS test for normality in them:

# Performs the KS normality test in the samples
ks_norm_a = ks_norm(standardize(norm_a))
ks_norm_b = ks_norm(standardize(norm_b))
ks_norm_c = ks_norm(standardize(norm_c))
ks_f_a = ks_norm(standardize(f_a))
# Prints the result
print(f"norm_a: ks = {ks_norm_a['ks_stat']:.4f} (p-value = {ks_norm_a['p_value']:.3e}, is normal = {ks_norm_a['p_value'] > 0.05})")
print(f"norm_b: ks = {ks_norm_b['ks_stat']:.4f} (p-value = {ks_norm_b['p_value']:.3e}, is normal = {ks_norm_b['p_value'] > 0.05})")
print(f"norm_c: ks = {ks_norm_c['ks_stat']:.4f} (p-value = {ks_norm_c['p_value']:.3e}, is normal = {ks_norm_c['p_value'] > 0.05})")
print(f"f_a: ks = {ks_f_a['ks_stat']:.4f} (p-value = {ks_f_a['p_value']:.3e}, is normal = {ks_f_a['p_value'] > 0.05})")

And the results are:

norm_a: ks = 0.0252 (p-value = 9.003e-01, is normal = True)
norm_b: ks = 0.0324 (p-value = 6.574e-01, is normal = True)
norm_c: ks = 0.0333 (p-value = 6.225e-01, is normal = True)
f_a: ks = 0.1538 (p-value = 8.548e-11, is normal = False)

We compare the p-value with the significance. If p<0.05 we reject the null hypothesis and assume that the sample does not come from a normal distribution, as it happens with f_a. All other three samples are considered normal, as expected.

As I said before, the same result could be obtained by using the scipy.stats.ks_1samp() function:

# Evaluates the KS test
ks_norm_a = stats.ks_1samp(x = standardize(norm_a), cdf = stats.norm.cdf)
ks_norm_b = stats.ks_1samp(x = standardize(norm_b), cdf = stats.norm.cdf)
ks_norm_c = stats.ks_1samp(x = standardize(norm_c), cdf = stats.norm.cdf)
ks_f_a = stats.ks_1samp(x = standardize(f_a), cdf = stats.norm.cdf)

The two-sample KS test allows us to compare any two given samples and check whether they came from the same distribution.

It differs from the 1-sample test in three main aspects:

  • We need to calculate the CDF for both distributions
  • The KS distribution uses the parameter en that involves the number of observations in both samples.
  • We should not standardize the samples if we wish to know if their distributions are identical or not.

It is easy to adapt the previous code for the 2-sample KS test:

And we can evaluate all possible pairs of samples:

The output is:

norm_a vs norm_b: ks = 0.0680 (p-value = 1.891e-01, are equal = True)
norm_a vs norm_c: ks = 0.8640 (p-value = 1.169e-216, are equal = False)
norm_a vs f_a: ks = 0.5720 (p-value = 6.293e-78, are equal = False)
norm_b vs norm_c: ks = 0.8680 (p-value = 5.772e-220, are equal = False)
norm_b vs f_a: ks = 0.5160 (p-value = 2.293e-62, are equal = False)
norm_c vs f_a: ks = 0.6580 (p-value = 1.128e-106, are equal = False)

As expected, only samples norm_a and norm_b can be sampled from the same distribution for a 5% significance. We cannot consider that the distributions of all the other pairs are equal.

I have detailed the KS test for didatic purposes, but both tests can easily be performed by using the scipy module on python.

The single-sample (normality) test can be performed by using the scipy.stats.ks_1samp function and the two-sample test can be done by using the scipy.stats.ks_2samp function. Check it out!

Now you have a new tool to compare distributions. KS is really useful, and since it is embedded on scipy, is also easy to use.

KS Test is also rather useful to evaluate classification models, and I will write a future article showing how can we do that.

Support me with a coffee!

Comparing sample distributions with the Kolmogorov-Smirnov (KS) test (9)

And read this awesome post

Evaluating classification models with Kolmogorov-Smirnov (KS) testUsing the KS test to evaluate the separation between class distributionstowardsdatascience.com

[1] Scipy Api Reference. scipy.stats.ks_2samp.

[2] Scipy Api Reference. scipy.stats.ks_1samp.

[3] Scipy Api Reference. scipy.stats.kstwo.

Comparing sample distributions with the Kolmogorov-Smirnov (KS) test (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Kimberely Baumbach CPA

Last Updated:

Views: 5671

Rating: 4 / 5 (61 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Kimberely Baumbach CPA

Birthday: 1996-01-14

Address: 8381 Boyce Course, Imeldachester, ND 74681

Phone: +3571286597580

Job: Product Banking Analyst

Hobby: Cosplaying, Inline skating, Amateur radio, Baton twirling, Mountaineering, Flying, Archery

Introduction: My name is Kimberely Baumbach CPA, I am a gorgeous, bright, charming, encouraging, zealous, lively, good person who loves writing and wants to share my knowledge and understanding with you.