Relation between columns is coefficient matrix which i will give a sample in another post.
Think you are guessing if someone is women or not based on
foot,chest,height,hair measurements.
You do not know which parameter effects much.
You run a ChiSqSelector test to understand most important n parameters.
setNumTopFeatures controls the n number.
import org.apache.spark.ml.feature.ChiSqSelector import org.apache.spark.ml.linalg.Vectors val data = Seq( // foot,chest,height,hair women (7, Vectors.dense(36, 90.0, 165.0, 20.0), 1.0), (8, Vectors.dense(38, 95.0, 170.0, 25.0), 1.0), (8, Vectors.dense(41, 60.0, 178.0, 10.0), 0.0), (9, Vectors.dense(42.0, 60.0, 165.0, 5.1), 0.0) ) val df = spark.createDataset(data).toDF("id", "features", "clicked") val selector = new ChiSqSelector() .setNumTopFeatures(2) .setFeaturesCol("features") .setLabelCol("clicked") .setOutputCol("selectedFeatures") val result = selector.fit(df).transform(df) result.show()
Result is as below.
You see selected features column.
Since setNumTopFeatures is 2 , there are 2 columns in result.
According to ChiSqSelector , most important column is foot and then chest.
| id| features|clicked|selectedFeatures|
| 7|[36.0,90.0,165.0,...| 1.0| [36.0,90.0]|
| 8|[38.0,95.0,170.0,...| 1.0| [38.0,95.0]|
| 8|[41.0,60.0,178.0,...| 0.0| [41.0,60.0]|
| 9|[42.0,60.0,165.0,...| 0.0| [42.0,60.0]|
No comments:
Post a Comment