You want to learn what kind of data you have.
Below is a simple code piece to begin investigating general properties of your data.
Suppose you have a data like 48,49,50,51,52. This is well distributed homogenous data.
import org.apache.commons.math3.stat.descriptive._ val df = Seq(48,49.0, 50.0, 51.0,52.0).toDF("nums") val mean = df.select("nums").rdd.map(row => row(0).asInstanceOf[Double]).collect() val arrMean = new DescriptiveStatistics() genericArrayOps(mean).foreach(v => arrMean.addValue(v)) val meanQ1 = arrMean.getPercentile(25) val meanQ3 = arrMean.getPercentile(75) val meanIQR = meanQ3 - meanQ1
Perfect distribution
47,48,49,50,51
n: 5 min: 48.0 max: 52.0 mean: 50.0 std dev: 1.5811388300841898 median: 50.0 skewness: 0.0 kurtosis: -1.200000000000002 meanQ1: Double = 48.5 meanQ3: Double = 51.5 meanIQR: Double = 3.0
Lets form a line shaped distribution
val df = Seq(50,50, 50, 50,50.0).toDF("nums")
n: 5 min: 50.0 max: 50.0 mean: 50.0 std dev: 0.0 median: 50.0 skewness: NaN kurtosis: NaN meanQ1: Double = 50.0 meanQ3: Double = 50.0 meanIQR: Double = 0.0
Lets add 40 to make left skew(negative skew.
** Skewness is asymmetry of distribution about mean.
Left tail (skew ) distribution
val df = Seq(40,48,49, 50, 51,52.0).toDF("nums")
n: 6 min: 40.0 max: 52.0 mean: 48.333333333333336 std dev: 4.320493798938574 median: 49.5 skewness: -1.8805720776629977 kurtosis: 3.9187500000000064 meanQ1: Double = 46.0 meanQ3: Double = 51.25 meanIQR: Double = 5.25
Rigth tail (skew ) distribution
If we just add 60 to original series we get a right tail distribution.
Skewness is same with different sign.
val df = Seq(48,49, 50, 51,52.0,60).toDF("nums")
n: 6 min: 48.0 max: 60.0 mean: 51.666666666666664 std dev: 4.320493798938574 median: 50.5 skewness: 1.8805720776629975 kurtosis: 3.9187500000000064 meanQ1: Double = 48.75 meanQ3: Double = 54.0 meanIQR: Double = 5.25
meanIQR is a data without boundaries. So it gives lots of idea if you know your domain.
For example you have a car price data. You know that car must be around 50.000$.
When you check meanIQR you will see datas near to your expectation. Others will
have have meaningless high( irreal expectation of seller) or low( this time meaningful because
car could be damaged.) meanIQR is a nice measure.
Skewness can give a rough idea about tendency of data. (Data having a tail to left if minus.)
kurtosis is a measure of shape. The sharper the top the higher the kurtosis. Check picture from internet please.
No comments:
Post a Comment