Thursday, December 22, 2016

ChiSqSelector for Top Feature Selection

When i get data first I try to understand most important column and relation between columns.
Relation between columns is coefficient matrix which i will give a sample in another post.

Think you are guessing if someone is women or not based on
foot,chest,height,hair measurements.
You do not know which parameter effects much.
You run a ChiSqSelector test to understand most important n parameters.
setNumTopFeatures controls the n number.

import org.apache.spark.ml.feature.ChiSqSelector
import org.apache.spark.ml.linalg.Vectors

val data = Seq(
  //            foot,chest,height,hair  women
  (7, Vectors.dense(36, 90.0, 165.0, 20.0), 1.0),
  (8, Vectors.dense(38, 95.0, 170.0, 25.0), 1.0),
  (8, Vectors.dense(41, 60.0, 178.0, 10.0), 0.0),
  (9, Vectors.dense(42.0, 60.0, 165.0, 5.1), 0.0)
)

val df = spark.createDataset(data).toDF("id", "features", "clicked")

val selector = new ChiSqSelector()
  .setNumTopFeatures(2)
  .setFeaturesCol("features")
  .setLabelCol("clicked")
  .setOutputCol("selectedFeatures")

val result = selector.fit(df).transform(df)
result.show()

Result is as below.
You see selected features column.
Since setNumTopFeatures is 2 , there are 2 columns in result.
According to ChiSqSelector , most important column is foot and then chest.

| id| features|clicked|selectedFeatures|
| 7|[36.0,90.0,165.0,...| 1.0| [36.0,90.0]|
| 8|[38.0,95.0,170.0,...| 1.0| [38.0,95.0]|
| 8|[41.0,60.0,178.0,...| 0.0| [41.0,60.0]|
| 9|[42.0,60.0,165.0,...| 0.0| [42.0,60.0]|

No comments:

Post a Comment