Thursday, December 22, 2016

Spark BinaryClassificationMetrics

After finishing a LogisticRegression we can check if result is good with BinaryClassificationMetrics.
It simply takes 2 parameters.
One is score associated with your predicition.(rawPrediction column after a Logistic Regression) for example.
And other is what you guesses.
For ROC you must have a big area near 1.

What does this mean.
Suppose you are measuring if you use heater according to weather.
(of course this is obvious, we are now doing obvious case)

Say at 10 F : do not use
20 F : do not use
...
50 F : use
..
100 F : use

You see for low scores ,u do not use, but for high ones you use.
True and false it perfectly separated so I expect a perfect ROC.

ROC is a graph showing what we gain as data for calculations we did with Logistic Regression.
for example we can have 4 data for one point. If you check below it means we only use heater once on this period.
So value 10 give 3 0 and 1 1 value. This makes learning of value 10 less efficient.
Intervals must give as much as information as possible.
Purified intervals will only output 1 value for so that information gain. is so high.

( 10.0, 0.0),
( 10.0, 0.0),
( 10.0, 0.0),
( 10.0, 1.0),


val metricData= sc.parallelize(
  
   Seq( 
     ( 10.0,  0.0),
     ( 20.0,  0.0),
     ( 30.0,  0.0),
     ( 40.0,  0.0),
     ( 50.0,  0.0),
     ( 60.0,  1.0),
     ( 70.0,  1.0),
     ( 80.0,  1.0),
     ( 90.0,  1.0),     
     ( 100.0,  1.0)
    
     )
);

val metrics = new BinaryClassificationMetrics(metricData) 
println("area under the precision-recall curve: " + metrics.areaUnderPR)
println("area under the receiver operating characteristic (ROC) curve : " + metrics.areaUnderROC)
metrics.roc().collect()



Above case was so good so metrics are below.

area under the precision-recall curve: 1.0 
area under the receiver operating characteristic (ROC) curve : 0.9999999999999999 
Array[(Double, Double)] = Array((0.0,0.0), (0.0,0.2), (0.0,0.4), (0.0,0.6), (0.0,0.8), (0.0,1.0), (0.2,1.0), (0.4,1.0), (0.6,1.0), (0.8,1.0), (1.0,1.0), (1.0,1.0))







Lets preapre a bad data where distribution is useless.
Think you are measuring your ice-tea consumption according to weather.
As in above you do need have a pattern. You do not drink at 10F but you drink 20 ...
So this is near random distribution. And random distribution gives 0.5 area under curve.
It is 45 degree line. A line like that means on every probability(score) of event
I have equal info from True or False case.


val metricData= sc.parallelize(
  
   Seq( 
     ( 10.0,  0.0),
     ( 20.0,  1.0),
     ( 30.0,  0.0),
     ( 40.0,  1.0),
     ( 50.0,  0.0),
     ( 60.0,  1.0),
     ( 70.0,  0.0),
     ( 80.0,  1.0),
     ( 90.0,  0.0),     
     ( 100.0,  1.0)
    
     )
);

val metrics = new BinaryClassificationMetrics(metricData) 
println("area under the precision-recall curve: " + metrics.areaUnderPR)
println("area under the receiver operating characteristic (ROC) curve : " + metrics.areaUnderROC)
metrics.roc().collect()







Above case was so bad so metrics are below.

area under the precision-recall curve: 0.6393650793650794 
area under the receiver operating characteristic (ROC) curve : 0.6000000000000001 metrics: 
Array[(Double, Double)] = Array((0.0,0.0), (0.0,0.2), (0.2,0.2), (0.2,0.4), (0.4,0.4), (0.4,0.6), (0.6,0.6), (0.6,0.8), (0.8,0.8), (0.8,1.0), (1.0,1.0), (1.0,1.0))












No comments:

Post a Comment