Spark samples are for big files which contains thousands of lines.
Also you do not know data and can not play with it.
I put here simplest data set for spark mllib so that one can play and understand what metrics
are effected from which parameters.
It is not for seniors but perfect for beginners of who need to calibrate parameters with simple sets.
***
Below code is from sample Spark documentation. I changed Rdd so that one can play and understand
how data is distributed.

Here you can play with values and observe distribution of clusters.
Always print cluster centers. It will give you a clue for large datasets.
You can easily play with dataset and number of demanded clusters to get an idea of how
K-means work.
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel} import org.apache.spark.mllib.linalg.Vectors val parsedData = sc.parallelize(Seq( ( Vectors.dense(1.0, 1.0)), ( Vectors.dense(40.0, 40.0)), ( Vectors.dense(60.0, 60.0)), ( Vectors.dense(101.0, 101.1)) )) // Cluster the data into two classes using KMeans val numClusters = 2 val numIterations = 20 val clusters = KMeans.train(parsedData, numClusters, numIterations) // Evaluate clustering by computing Within Set Sum of Squared Errors val WSSSE = clusters.computeCost(parsedData) println("Within Set Sum of Squared Errors = " + WSSSE ) val clusterCenters = clusters.clusterCenters.map(_.toArray) println("The Cluster Centers are = " + clusterCenters) parsedData.collect().map( s=> println( "cluster "+clusters.predict(s) +" "+s.toString() ) )
Result
Within Set Sum of Squared Errors = 3874.8066666666673 clusterCenters: Array[Array[Double]] = Array(Array(67.0, 67.03333333333333), Array(1.0, 1.0)) cluster 1 [1.0,1.0] cluster 0 [40.0,40.0] cluster 0 [60.0,60.0] cluster 0 [101.0,101.1]
Same Code In R
pointx = c(1,2, 50, 51) pointy = c(1,2,50,51) df = data.frame(pointx, pointy) library(ggplot2) ggplot(df, aes(pointx, pointy)) + geom_point() myCluster <- kmeans(df, 3, nstart = 20) myCluster$centers myCluster$clus <- as.factor(myCluster$cluster) ggplot(df, aes(pointx, pointy, color = myCluster$clus)) + geom_point()
No comments:
Post a Comment