***
Spark samples are for big files which contains thousands of lines.
Also you do not know data and can not play with it.
I put here simplest data set for spark mllib so that one can play and understand what metrics
are effected from which parameters.
It is not for seniors but perfect for beginners of who need to calibrate parameters with simple sets.
***
In samples at internet people usually try to guess if a mail is spam or not.
Below code includes codes from spark samples and some other samples.
I tried to work with spark 2 but it was not success. It is working with 1.6.
Since it did not work i played a lot, took lots of fixes from net. So code is not neat.
Lets make it much more simpler. I will list some properties and try to guess if it is
Plane or Not.
My training is
"wing wheel engine" : 1 it is plane
"wheel airbag engine" : 0 it is not plane
Steps
1)Get training set
2)Tokenize it
3)Apply hashingtf
Result of hashingtf, it generates 2vectors of words.
0 wheel airbag engine ["wheel","airbag","engine"] {"type":0,"size":20,"indices":[3,14,18],"values":[1,1,1]}
1 wing wheel engine ["wing","wheel","engine"] {"type":0,"size":20,"indices":[3,7,14],"values":[1,1,1]}
4)Train model
Result of training
Array[org.apache.spark.mllib.regression.LabeledPoint] = Array(
(8.0,[0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0]),
(9.0,[0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0])
)
5)Prepare test data
(0,"wing airbag")
(1,"wing airport")
(0,"wing airport") False negative(this will be guest as plane, but it is zeppelin!!)
hashingtf generates below vectors for test data
[8,wing airbag,WrappedArray(wing, airbag),(20,[7,18],[1.0,1.0])],
[9,wing airport,WrappedArray(wing, airport),(20,[3,7],[1.0,1.0])])
7 was wing and 3 was wheel in model vectors.
6)Apply prediction testpredictionAndLabel
(0.0,0.0) I guessed as not plane, Not plane
(1.0,1.0) I guessed as plane , Plane
(1.0,0.0) I guessed as plane , Plane
7)Dump metrics, output is as below
Confusion matrix:
1.0 1.0
0.0 1.0
Precision(0.0) = 1.0
Precision(1.0) = 0.5
Recall(0.0) = 0.5
Recall(1.0) = 1.0
FPR(0.0) = 0.0
FPR(1.0) = 0.5
F1-Score(0.0) = 0.6666666666666666
F1-Score(1.0) = 0.6666666666666666
Weighted precision: 0.8333333333333333
Weighted recall: 0.6666666666666666
Weighted F1 score: 0.6666666666666666
Weighted false positive rate: 0.16666666666666666
labels: Array[Double] = Array(0.0, 1.0)
Precision (0.0) :1.0 we guest 1 zero(false) , that was correct so ratio 1 / 1 = 1
Precision (1.0) :0.5 we guest 2 one(true) , 1 was correct 1 not so ratio 1 / 2 = 0.5
Recall(0.0) :0.5 we guest 1 zero(false) , there was infact 2 zeros 1 / 2 = 0.5
Recall(1.0) :1.0 we guest 1 one(true) , there was correct 1 not so ratio 1 / 1 = 1
F1-Score(0.0) = 0.6666666666666666
F1- Score = 2 x ( precision x recall ) / precison + recall.
= 2 x ( 0.5 x 1 ) / 0.5 + 1 = 2 x 0.5 / 1.5 = 0.6
From definitions :
Precision can be seen as a measure of exactness or quality, whereas recall is a measure of completeness or quantity.
High precision means that an algorithm returned substantially more relevant results than irrelevant ones, while high recall means that an algorithm returned most of the relevant results.
What does these mean.
Think in our sample we have a bigger set and we say
there are 30 planes but only 20 of them is really (among 30)
then precision is 20 / 30 = this is how well we performed on our results.
But there are items we missed.
Think in fact there were total 50 planes.
Then recall = 20 / 50 = 0.4
It is what percent of real result we returned.
High precision Low recall : we are very good at estimation but we do not cover the whole space. It means
we choose cut-off value so high.
import org.apache.spark.ml.feature.{RegexTokenizer, Tokenizer}
import org.apache.spark.ml.feature.{HashingTF, IDF}
import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.evaluation.MulticlassMetrics
val trainData = sqlContext.createDataFrame(Seq((0,"wheel airbag engine"),(1,"wing wheel engine"))).toDF("category","text")
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val wordsData = tokenizer.transform(trainData)
val hashTF = new HashingTF().setInputCol("words").setOutputCol("features").setNumFeatures(20)
val featureData = hashTF.transform(wordsData)
val subFeature = featureData.select("category","features");
val df_1 = subFeature.withColumnRenamed("category","category2")
val trainDataRdd2 = df_1.withColumn("category",df_1.col("category2").cast("double")).drop("category2")
trainDataRdd2.printSchema()
val testScoreAndLabel = trainDataRdd2.select("category","features").map{ case Row(l:Double,p:Vector) => LabeledPoint(l,p) }
val model = NaiveBayes.train(testScoreAndLabel, lambda = 1.0, modelType = "multinomial")
//same for the test data
val testData = sqlContext.createDataFrame(Seq((0,"wing airbag"),(1,"wing airport"),(0,"wing airport"))).toDF("category","text")
val testWordData = tokenizer.transform(testData)
val testFeatureData = hashTF.transform(testWordData)
val testDataRdd = testFeatureData.select("category","features").map {
case Row(label: Int, features: Vector) =>
LabeledPoint(label.toDouble, Vectors.dense(features.toArray))
}
val testpredictionAndLabel = testDataRdd.map(p => (model.predict(p.features), p.label))
val metrics = new MulticlassMetrics(testpredictionAndLabel)
/* output F1-measure for all labels (0 and 1, negative and positive) */
metrics.labels.foreach( l => println(metrics.fMeasure(l)))
testpredictionAndLabel.take(5)
// Confusion matrix
println("Confusion matrix:")
println(metrics.confusionMatrix)
// Precision by label
val labels = metrics.labels
labels.foreach { l =>
println(s"Precision($l) = " + metrics.precision(l))
}
// Recall by label
labels.foreach { l =>
println(s"Recall($l) = " + metrics.recall(l))
}
// False positive rate by label
labels.foreach { l =>
println(s"FPR($l) = " + metrics.falsePositiveRate(l))
}
// F-measure by label
labels.foreach { l =>
println(s"F1-Score($l) = " + metrics.fMeasure(l))
}
// Weighted stats
println(s"Weighted precision: ${metrics.weightedPrecision}")
println(s"Weighted recall: ${metrics.weightedRecall}")
println(s"Weighted F1 score: ${metrics.weightedFMeasure}")
println(s"Weighted false positive rate: ${metrics.weightedFalsePositiveRate}")