Sometimes we receive a data with lots of null cells for analyzing. For example a have a car data
root
|-- product: string (nullable = true)
|-- price: integer (nullable = true)
|-- km: double (nullable = true)
|-- color: string (nullable = true)
|-- fuel: string (nullable = true)
|-- city: string (nullable = true)
|-- gear: string (nullable = true)
|-- year: integer (nullable = true)
Mostly people do not input km. If i remove the item with null km i end up with a sparse dataset.
Unwillingly I do Impute Meaning, which is not in fact a very good thing because variances and regression
will be effected.
training = dataset.withColumn("label", dataset['km']*1.0).na.replace(0,averagekm)
No comments:
Post a Comment