I here give a sample I use,which i combined methods i found.
My intention is selectively applying imputing to numeric columns.
When
Q1 is 1st quantile and
Q3 is 3rd quantile
Below ranges are outliers by definition.
below Q1 – 1.5×IQR or above Q3 + 1.5×IQR
Below code replaces value as x < min or x > max with mean value. According to your
data median could be a better choice.
numcol <- c(1,3,40,50,600)
numcol2 <- c(2,420,400,500,600)
charcol <- c("a","a","b","b","a")
df <- data.frame(a=numcol,b=charcol,c=numcol2)
#select numeric columns to change
columnsToChange <- c("a","c")
df
for(i in columnsToChange){
Q1 <- quantile(df[,i],0.75, na.rm=TRUE)
max <- Q1 + (IQR(df[,i], na.rm=TRUE) * 1.5 )
Q3 <- quantile(df[,i],0.25, na.rm=TRUE)
min <- Q3 - (IQR(df[,i], na.rm=TRUE) * 1.5 )
message(sprintf("min , max mean %s %s mean of column %s \n", min,max ,mean(df[,i] )) )
indexesstochange <- which(df[,i] < min | df[,i] > max)
message(sprintf("indexes to change %s \n", indexesstochange ))
df[,i][indexesstochange] <- mean(df[,i])
}
df
It produces the output below. For column a outlier is max value at index 5.
For column c outlier is at min value at index 1.
1 1 a 2
2 3 a 420
3 40 b 400
4 50 b 500
5 600 a 600
min , max mean -67.5 120.5 mean of column 138.8
indexes to change 5
min , max mean 250 650 mean of column 384.4
indexes to change 1
a b c
1 1.0 a 384.4
2 3.0 a 420.0
3 40.0 b 400.0
4 50.0 b 500.0
5 138.8 a 600.0
No comments:
Post a Comment