fun <- function(x){ quantiles <- quantile( x, c(.05, .95 ) ) x[ x < quantiles[1] ] <- quantiles[1] x[ x > quantiles[2] ] <- quantiles[2] x } fun( yourdata )I tried to write my functions for Spark. Below you will find methods for replacing values
Q1 is 1st quantile and
Q3 is 3rd quantile
left outliers < Q1 – 1.5×IQR
right outliers > Q3 + 1.5×IQR
For a dataset I obtain below values for these results.
org.apache.spark.util.StatCounter = (count: 13, mean: 131.538462, stdev: 65.441242, max: 320.000000, min: 1.000000) rddMin: Double = 1.0 rddMax: Double = 320.0 rddMean: Double = 131.53846153846155 quantiles: Array[Double] = Array(117.0, 150.0) Q1: Double = 117.0 Q3: Double = 150.0 IQR: Double = 33.0 lowerRange: Double = 67.5 upperRange: Double = 199.5So according to our data set and our logic we can make below changes.(Of course there are more)
left outliers -> Q1
left outliers -> mean
right outliers -> Q3
right outliers -> mean
You can also want to impute values less than Q1 to Q1.
import org.apache.spark.sql.functions._ val imputeLessToTarget: (Double,Double,Double ) => Double = (arg: Double,treshold: Double,target: Double) => { if (arg < treshold) target else arg} val imputeLessToTargetUDF = udf(imputeLessToTarget) val imputeMoreToTarget: (Double,Double,Double ) => Double = (arg: Double,treshold: Double,target: Double) => { if (arg > treshold) target else arg} val imputeMoreToTargetUDF = udf(imputeMoreToTarget) //Change values less than lowerRange of outliers to Q1 **** Capping val df5 = df.withColumn("replaced", imputeLessToTargetUDF(col("original") ,lit(lowerRange),lit(Q1) )) println( "df5 Change values less than lowerRange of outliers to Q1 **** Capping") df5.show() //Change values less than lowerRange of outliers to mean val df6 = df.withColumn("replaced", imputeLessToTargetUDF(col("original") ,lit(lowerRange),lit(rddMean) )) println( "df6 Change values less than lowerRange of outliers to mean") df6.show() //Change values less than Q1 to mean val df7 = df.withColumn("replaced", imputeLessToTargetUDF(col("original") ,lit(Q1),lit(rddMean) )) println( "df7 Change values less than Q1 to mean") df7.show() //Change values more than upperRange of outliers to Q3 **** Capping val df8 = df.withColumn("replaced", imputeMoreToTargetUDF(col("original") ,lit(upperRange),lit(Q3) )) println( "df8 Change values more than upperRange of outliers to Q3 **** Capping") df8.show() //Change values more than upperRange of outliers to mean val df9 = df.withColumn("replaced", imputeMoreToTargetUDF(col("original") ,lit(upperRange),lit(rddMean) )) println( "df9 Change values more than upperRange of outliers to mean") df9.show()At below you can see output of generating new columns. These type of functions are my toolset for dealing with new data.
df5 Change values less than lowerRange of outliers to Q1 **** Capping 1.0| 117.0 110.0| 110.0 111.0| 111.0 112.0| 112.0 117.0| 117.0 118.0| 118.0 120.0| 120.0 122.0| 122.0 129.0| 129.0 140.0| 140.0 150.0| 150.0 160.0| 160.0 320.0| 320.0 df6 Change values less than lowerRange of outliers to mean 1.0|131.53846153846155 110.0| 110.0 111.0| 111.0 112.0| 112.0 117.0| 117.0 118.0| 118.0 120.0| 120.0 122.0| 122.0 129.0| 129.0 140.0| 140.0 150.0| 150.0 160.0| 160.0 320.0| 320.0 df7 Change values less than Q1 to mean 1.0|131.53846153846155 110.0|131.53846153846155 111.0|131.53846153846155 112.0|131.53846153846155 117.0| 117.0 118.0| 118.0 120.0| 120.0 122.0| 122.0 129.0| 129.0 140.0| 140.0 150.0| 150.0 160.0| 160.0 320.0| 320.0 df8 Change values more than upperRange of outliers to Q3 **** Capping 1.0| 150.0 110.0| 150.0 111.0| 150.0 112.0| 150.0 117.0| 150.0 118.0| 150.0 120.0| 150.0 122.0| 150.0 129.0| 150.0 140.0| 150.0 150.0| 150.0 160.0| 150.0 320.0| 320.0 df9 Change values more than upperRange of outliers to mean 1.0|131.53846153846155 110.0|131.53846153846155 111.0|131.53846153846155 112.0|131.53846153846155 117.0|131.53846153846155 118.0|131.53846153846155 120.0|131.53846153846155 122.0|131.53846153846155 129.0|131.53846153846155 140.0|131.53846153846155 150.0|131.53846153846155 160.0|131.53846153846155 320.0| 320.0
No comments:
Post a Comment