Monday, January 9, 2017

R Impute Dataframe( Replace outliers )

There are some methods over internet for imputing outliers.
I here give a sample I use,which i combined methods i found.
My intention is selectively applying imputing to numeric columns.
When
Q1 is 1st quantile and
Q3 is 3rd quantile
Below ranges are outliers by definition.
below Q1 – 1.5×IQR or above Q3 + 1.5×IQR


Below code replaces value as x < min or x > max with mean value. According to your
data median could be a better choice.

numcol <- c(1,3,40,50,600)
numcol2 <- c(2,420,400,500,600)
charcol <- c("a","a","b","b","a")


df <- data.frame(a=numcol,b=charcol,c=numcol2)
#select numeric columns to change
columnsToChange <- c("a","c")

df
for(i in columnsToChange){
  Q1 <- quantile(df[,i],0.75, na.rm=TRUE) 
  max <- Q1 + (IQR(df[,i], na.rm=TRUE) * 1.5 )
  
  Q3 <- quantile(df[,i],0.25, na.rm=TRUE)
  min <- Q3 - (IQR(df[,i], na.rm=TRUE) * 1.5 )
  
  message(sprintf("min ,  max  mean  %s %s mean of column %s \n", min,max ,mean(df[,i] )) )
  
  indexesstochange <- which(df[,i] < min | df[,i] > max)
  
  message(sprintf("indexes to change %s \n", indexesstochange ))
 
  df[,i][indexesstochange] <- mean(df[,i])
}
df
It produces the output below. For column a outlier is max value at index 5. For column c outlier is at min value at index 1.
1   1 a   2
2   3 a 420
3  40 b 400
4  50 b 500
5 600 a 600

min ,  max  mean  -67.5 120.5 mean of column 138.8 

indexes to change 5 

min ,  max  mean  250 650 mean of column 384.4 

indexes to change 1 


      a   b     c
1   1.0 a 384.4
2   3.0 a 420.0
3  40.0 b 400.0
4  50.0 b 500.0
5 138.8 a 600.0

No comments:

Post a Comment