I here give a sample I use,which i combined methods i found.
My intention is selectively applying imputing to numeric columns.
When
Q1 is 1st quantile and
Q3 is 3rd quantile
Below ranges are outliers by definition.
below Q1 – 1.5×IQR or above Q3 + 1.5×IQR
Below code replaces value as x < min or x > max with mean value. According to your
data median could be a better choice.
numcol <- c(1,3,40,50,600) numcol2 <- c(2,420,400,500,600) charcol <- c("a","a","b","b","a") df <- data.frame(a=numcol,b=charcol,c=numcol2) #select numeric columns to change columnsToChange <- c("a","c") df for(i in columnsToChange){ Q1 <- quantile(df[,i],0.75, na.rm=TRUE) max <- Q1 + (IQR(df[,i], na.rm=TRUE) * 1.5 ) Q3 <- quantile(df[,i],0.25, na.rm=TRUE) min <- Q3 - (IQR(df[,i], na.rm=TRUE) * 1.5 ) message(sprintf("min , max mean %s %s mean of column %s \n", min,max ,mean(df[,i] )) ) indexesstochange <- which(df[,i] < min | df[,i] > max) message(sprintf("indexes to change %s \n", indexesstochange )) df[,i][indexesstochange] <- mean(df[,i]) } dfIt produces the output below. For column a outlier is max value at index 5. For column c outlier is at min value at index 1.
1 1 a 2 2 3 a 420 3 40 b 400 4 50 b 500 5 600 a 600 min , max mean -67.5 120.5 mean of column 138.8 indexes to change 5 min , max mean 250 650 mean of column 384.4 indexes to change 1 a b c 1 1.0 a 384.4 2 3.0 a 420.0 3 40.0 b 400.0 4 50.0 b 500.0 5 138.8 a 600.0
No comments:
Post a Comment