Sunday, October 2, 2016

Mean Impute for Missing data

Sometimes we receive a data with lots of null cells for analyzing. For example a have a car data

root
|-- product: string (nullable = true)
|-- price: integer (nullable = true)
|-- km: double (nullable = true)
|-- color: string (nullable = true)
|-- fuel: string (nullable = true)
|-- city: string (nullable = true)
|-- gear: string (nullable = true)
|-- year: integer (nullable = true)

Mostly people do not input km. If i remove the item with null km i end up with a sparse dataset.
Unwillingly I do Impute Meaning, which is not in fact a very good thing because variances and regression
will be effected.



training = dataset.withColumn("label", dataset['km']*1.0).na.replace(0,averagekm)

No comments:

Post a Comment