Saturday, December 17, 2016

Spark Apply Descriptive Statistics on DataFrame

When you first get your data you have to play with it.
You want to learn what kind of data you have.
Below is a simple code piece to begin investigating general properties of your data.

Suppose you have a data like 48,49,50,51,52. This is well distributed homogenous data.

import org.apache.commons.math3.stat.descriptive._

val df = Seq(48,49.0, 50.0, 51.0,52.0).toDF("nums")

val mean = df.select("nums").rdd.map(row => row(0).asInstanceOf[Double]).collect()

val arrMean = new DescriptiveStatistics()
genericArrayOps(mean).foreach(v => arrMean.addValue(v))

val meanQ1 = arrMean.getPercentile(25)
val meanQ3 = arrMean.getPercentile(75)
val meanIQR = meanQ3 - meanQ1





Perfect distribution
47,48,49,50,51
n: 5 
min: 48.0 max: 52.0 mean: 50.0 
std dev: 1.5811388300841898 
median: 50.0 
skewness: 0.0 
kurtosis: -1.200000000000002 
meanQ1: Double = 48.5 
meanQ3: Double = 51.5 
meanIQR: Double = 3.0



Lets form a line shaped distribution
val df = Seq(50,50, 50, 50,50.0).toDF("nums")

n: 5 
min: 50.0 
max: 50.0 
mean: 50.0 
std dev: 0.0 
median: 50.0 
skewness: NaN 
kurtosis: NaN 
meanQ1: Double = 50.0 
meanQ3: Double = 50.0 
meanIQR: Double = 0.0

Lets add 40 to make left skew(negative skew.
** Skewness is asymmetry of distribution about mean.



Left tail (skew ) distribution

val df = Seq(40,48,49, 50, 51,52.0).toDF("nums")
n: 6 
min: 40.0 
max: 52.0 
mean: 48.333333333333336 
std dev: 4.320493798938574 
median: 49.5 
skewness: -1.8805720776629977 
kurtosis: 3.9187500000000064 
meanQ1: Double = 46.0 
meanQ3: Double = 51.25 
meanIQR: Double = 5.25


Rigth tail (skew ) distribution
If we just add 60 to original series we get a right tail distribution.
Skewness is same with different sign.
val df = Seq(48,49, 50, 51,52.0,60).toDF("nums")

n: 6 
min: 48.0 
max: 60.0 
mean: 51.666666666666664 
std dev: 4.320493798938574 
median: 50.5 
skewness: 1.8805720776629975 
kurtosis: 3.9187500000000064

meanQ1: Double = 48.75 
meanQ3: Double = 54.0 
meanIQR: Double = 5.25

meanIQR is a data without boundaries. So it gives lots of idea if you know your domain.
For example you have a car price data. You know that car must be around 50.000$.
When you check meanIQR you will see datas near to your expectation. Others will
have have meaningless high( irreal expectation of seller) or low( this time meaningful because
car could be damaged.) meanIQR is a nice measure.

Skewness can give a rough idea about tendency of data. (Data having a tail to left if minus.)

kurtosis is a measure of shape. The sharper the top the higher the kurtosis. Check picture from internet please.













No comments:

Post a Comment