Monday, December 26, 2016

Spark and R K-Means classification

***
Spark samples are for big files which contains thousands of lines.
Also you do not know data and can not play with it.
I put here simplest data set for spark mllib so that one can play and understand what metrics
are effected from which parameters.
It is not for seniors but perfect for beginners of who need to calibrate parameters with simple sets.
***
Below code is from sample Spark documentation. I changed Rdd so that one can play and understand
how data is distributed.




Here you can play with values and observe distribution of clusters.
Always print cluster centers. It will give you a clue for large datasets.

You can easily play with dataset and number of demanded clusters to get an idea of how
K-means work.


import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors

val parsedData = sc.parallelize(Seq(
  ( Vectors.dense(1.0, 1.0)),
  ( Vectors.dense(40.0, 40.0)),
  ( Vectors.dense(60.0, 60.0)),
  ( Vectors.dense(101.0, 101.1))
))

// Cluster the data into two classes using KMeans
val numClusters = 2
val numIterations = 20
val clusters = KMeans.train(parsedData, numClusters, numIterations)

// Evaluate clustering by computing Within Set Sum of Squared Errors
val WSSSE = clusters.computeCost(parsedData)
println("Within Set Sum of Squared Errors = " + WSSSE )
val clusterCenters = clusters.clusterCenters.map(_.toArray)
println("The Cluster Centers are = " + clusterCenters)
parsedData.collect().map( s=> println( "cluster "+clusters.predict(s) +" "+s.toString() ) )

Result
Within Set Sum of Squared Errors = 3874.8066666666673
clusterCenters: Array[Array[Double]] = Array(Array(67.0, 67.03333333333333), Array(1.0, 1.0))
cluster 1 [1.0,1.0] cluster 0 [40.0,40.0] cluster 0 [60.0,60.0] cluster 0 [101.0,101.1]

Same Code In R

pointx = c(1,2, 50, 51) 
pointy = c(1,2,50,51) 
df = data.frame(pointx, pointy)
library(ggplot2)
ggplot(df, aes(pointx, pointy)) + geom_point()
myCluster <- kmeans(df, 3, nstart = 20)
myCluster$centers
myCluster$clus <- as.factor(myCluster$cluster)
ggplot(df, aes(pointx, pointy, color = myCluster$clus)) + geom_point()

No comments:

Post a Comment