Wednesday, January 18, 2017

Impute Outliers In Spark

Imputing data in R is so easy.

fun <- function(x){
    quantiles <- quantile( x, c(.05, .95 ) )
    x[ x < quantiles[1] ] <- quantiles[1]
    x[ x > quantiles[2] ] <- quantiles[2]
    x
}
fun( yourdata )
I tried to write my functions for Spark. Below you will find methods for replacing values
Q1 is 1st quantile and
Q3 is 3rd quantile
left outliers < Q1 – 1.5×IQR
right outliers > Q3 + 1.5×IQR
For a dataset I obtain below values for these results.
org.apache.spark.util.StatCounter = (count: 13, mean: 131.538462, stdev: 65.441242, max: 320.000000, min: 1.000000) 
rddMin: Double = 1.0 
rddMax: Double = 320.0 
rddMean: Double = 131.53846153846155 
quantiles: Array[Double] = Array(117.0, 150.0) 
Q1: Double = 117.0 
Q3: Double = 150.0 
IQR: Double = 33.0 
lowerRange: Double = 67.5 
upperRange: Double = 199.5
So according to our data set and our logic we can make below changes.(Of course there are more)
left outliers -> Q1
left outliers -> mean
right outliers -> Q3
right outliers -> mean
You can also want to impute values less than Q1 to Q1.
import org.apache.spark.sql.functions._

val imputeLessToTarget: (Double,Double,Double ) => Double = (arg: Double,treshold: Double,target: Double) => { if (arg < treshold) target else arg}
val imputeLessToTargetUDF = udf(imputeLessToTarget)

val imputeMoreToTarget: (Double,Double,Double ) => Double = (arg: Double,treshold: Double,target: Double) => { if (arg > treshold) target else arg}
val imputeMoreToTargetUDF = udf(imputeMoreToTarget)


//Change values less than lowerRange of outliers to Q1 **** Capping
val df5 = df.withColumn("replaced", imputeLessToTargetUDF(col("original") ,lit(lowerRange),lit(Q1) ))
println( "df5 Change values less than lowerRange of outliers to Q1 **** Capping")
df5.show()
//Change values less than lowerRange of outliers to mean
val df6 = df.withColumn("replaced", imputeLessToTargetUDF(col("original") ,lit(lowerRange),lit(rddMean) ))
println( "df6 Change values less than lowerRange of outliers to mean")
df6.show()
//Change values less than Q1 to mean
val df7 = df.withColumn("replaced", imputeLessToTargetUDF(col("original") ,lit(Q1),lit(rddMean) ))
println( "df7 Change values less than Q1 to mean")
df7.show()
//Change values more than upperRange of outliers to Q3 **** Capping
val df8 = df.withColumn("replaced", imputeMoreToTargetUDF(col("original") ,lit(upperRange),lit(Q3) ))
println( "df8 Change values more than upperRange of outliers to Q3 **** Capping")
df8.show()
//Change values more than upperRange of outliers to mean
val df9 = df.withColumn("replaced", imputeMoreToTargetUDF(col("original") ,lit(upperRange),lit(rddMean) ))
println( "df9 Change values more than upperRange of outliers to mean")
df9.show()
At below you can see output of generating new columns. These type of functions are my toolset for dealing with new data.
df5 Change values less than lowerRange of outliers to Q1 **** Capping 
1.0| 117.0
110.0| 110.0
111.0| 111.0
112.0| 112.0
117.0| 117.0
118.0| 118.0
120.0| 120.0
122.0| 122.0
129.0| 129.0
140.0| 140.0
150.0| 150.0
160.0| 160.0
320.0| 320.0
df6 Change values less than lowerRange of outliers to mean 
1.0|131.53846153846155
110.0| 110.0
111.0| 111.0
112.0| 112.0
117.0| 117.0
118.0| 118.0
120.0| 120.0
122.0| 122.0
129.0| 129.0
140.0| 140.0
150.0| 150.0
160.0| 160.0
320.0| 320.0
df7 Change values less than Q1 to mean 
1.0|131.53846153846155
110.0|131.53846153846155
111.0|131.53846153846155
112.0|131.53846153846155
117.0| 117.0
118.0| 118.0
120.0| 120.0
122.0| 122.0
129.0| 129.0
140.0| 140.0
150.0| 150.0
160.0| 160.0
320.0| 320.0
df8 Change values more than upperRange of outliers to Q3 **** Capping
1.0| 150.0
110.0| 150.0
111.0| 150.0
112.0| 150.0
117.0| 150.0
118.0| 150.0
120.0| 150.0
122.0| 150.0
129.0| 150.0
140.0| 150.0
150.0| 150.0
160.0| 150.0
320.0| 320.0
df9 Change values more than upperRange of outliers to mean 
1.0|131.53846153846155
110.0|131.53846153846155
111.0|131.53846153846155
112.0|131.53846153846155
117.0|131.53846153846155
118.0|131.53846153846155
120.0|131.53846153846155
122.0|131.53846153846155
129.0|131.53846153846155
140.0|131.53846153846155
150.0|131.53846153846155
160.0|131.53846153846155
320.0| 320.0

No comments:

Post a Comment