Monday, December 26, 2016

Spark and R K-Means classification

***
Spark samples are for big files which contains thousands of lines.
Also you do not know data and can not play with it.
I put here simplest data set for spark mllib so that one can play and understand what metrics
are effected from which parameters.
It is not for seniors but perfect for beginners of who need to calibrate parameters with simple sets.
***
Below code is from sample Spark documentation. I changed Rdd so that one can play and understand
how data is distributed.




Here you can play with values and observe distribution of clusters.
Always print cluster centers. It will give you a clue for large datasets.

You can easily play with dataset and number of demanded clusters to get an idea of how
K-means work.


import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors

val parsedData = sc.parallelize(Seq(
  ( Vectors.dense(1.0, 1.0)),
  ( Vectors.dense(40.0, 40.0)),
  ( Vectors.dense(60.0, 60.0)),
  ( Vectors.dense(101.0, 101.1))
))

// Cluster the data into two classes using KMeans
val numClusters = 2
val numIterations = 20
val clusters = KMeans.train(parsedData, numClusters, numIterations)

// Evaluate clustering by computing Within Set Sum of Squared Errors
val WSSSE = clusters.computeCost(parsedData)
println("Within Set Sum of Squared Errors = " + WSSSE )
val clusterCenters = clusters.clusterCenters.map(_.toArray)
println("The Cluster Centers are = " + clusterCenters)
parsedData.collect().map( s=> println( "cluster "+clusters.predict(s) +" "+s.toString() ) )

Result
Within Set Sum of Squared Errors = 3874.8066666666673
clusterCenters: Array[Array[Double]] = Array(Array(67.0, 67.03333333333333), Array(1.0, 1.0))
cluster 1 [1.0,1.0] cluster 0 [40.0,40.0] cluster 0 [60.0,60.0] cluster 0 [101.0,101.1]

Same Code In R

pointx = c(1,2, 50, 51) 
pointy = c(1,2,50,51) 
df = data.frame(pointx, pointy)
library(ggplot2)
ggplot(df, aes(pointx, pointy)) + geom_point()
myCluster <- kmeans(df, 3, nstart = 20)
myCluster$centers
myCluster$clus <- as.factor(myCluster$cluster)
ggplot(df, aes(pointx, pointy, color = myCluster$clus)) + geom_point()

Friday, December 23, 2016

Spark NaiveBayes and Result Interpretation

***
Spark samples are for big files which contains thousands of lines.
Also you do not know data and can not play with it.
I put here simplest data set for spark mllib so that one can play and understand what metrics
are effected from which parameters.
It is not for seniors but perfect for beginners of who need to calibrate parameters with simple sets.
***
In samples at internet people usually try to guess if a mail is spam or not.
Below code includes codes from spark samples and some other samples.
I tried to work with spark 2 but it was not success. It is working with 1.6.
Since it did not work i played a lot, took lots of fixes from net. So code is not neat.

Lets make it much more simpler. I will list some properties and try to guess if it is
Plane or Not.
My training is
"wing wheel engine" : 1 it is plane
"wheel airbag engine" : 0 it is not plane

Steps
1)Get training set
2)Tokenize it
3)Apply hashingtf

Result of hashingtf, it generates 2vectors of words.

0 wheel airbag engine ["wheel","airbag","engine"] {"type":0,"size":20,"indices":[3,14,18],"values":[1,1,1]}
1 wing wheel engine ["wing","wheel","engine"] {"type":0,"size":20,"indices":[3,7,14],"values":[1,1,1]}


4)Train model
Result of training
Array[org.apache.spark.mllib.regression.LabeledPoint] = Array(
(8.0,[0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0]), 
(9.0,[0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0])
)

5)Prepare test data
(0,"wing airbag")
(1,"wing airport")
(0,"wing airport") False negative(this will be guest as plane, but it is zeppelin!!)
hashingtf generates below vectors for test data

[8,wing airbag,WrappedArray(wing, airbag),(20,[7,18],[1.0,1.0])], 
[9,wing airport,WrappedArray(wing, airport),(20,[3,7],[1.0,1.0])])

7 was wing and 3 was wheel in model vectors.

6)Apply prediction testpredictionAndLabel
(0.0,0.0) I guessed as not plane, Not plane
(1.0,1.0) I guessed as plane , Plane
(1.0,0.0) I guessed as plane , Plane

7)Dump metrics, output is as below

Confusion matrix: 
1.0 1.0 
0.0 1.0 
Precision(0.0) = 1.0 
Precision(1.0) = 0.5 
Recall(0.0) = 0.5 
Recall(1.0) = 1.0 
FPR(0.0) = 0.0 
FPR(1.0) = 0.5 
F1-Score(0.0) = 0.6666666666666666 
F1-Score(1.0) = 0.6666666666666666 
Weighted precision: 0.8333333333333333 
Weighted recall: 0.6666666666666666 
Weighted F1 score: 0.6666666666666666 
Weighted false positive rate: 0.16666666666666666 
labels: Array[Double] = Array(0.0, 1.0)

Precision (0.0) :1.0 we guest 1 zero(false) , that was correct so ratio 1 / 1 = 1
Precision (1.0) :0.5 we guest 2 one(true) , 1 was correct 1 not so ratio 1 / 2 = 0.5

Recall(0.0) :0.5 we guest 1 zero(false) , there was infact 2 zeros 1 / 2 = 0.5
Recall(1.0) :1.0 we guest 1 one(true) , there was correct 1 not so ratio 1 / 1 = 1

F1-Score(0.0) = 0.6666666666666666

F1- Score = 2 x ( precision x recall ) / precison + recall.
= 2 x ( 0.5 x 1 ) / 0.5 + 1 = 2 x 0.5 / 1.5 = 0.6

From definitions :
Precision can be seen as a measure of exactness or quality, whereas recall is a measure of completeness or quantity.
High precision means that an algorithm returned substantially more relevant results than irrelevant ones, while high recall means that an algorithm returned most of the relevant results.

What does these mean.
Think in our sample we have a bigger set and we say
there are 30 planes but only 20 of them is really (among 30)
then precision is 20 / 30 = this is how well we performed on our results.
But there are items we missed.
Think in fact there were total 50 planes.
Then recall = 20 / 50 = 0.4
It is what percent of real result we returned.


High precision Low recall : we are very good at estimation but we do not cover the whole space. It means
we choose cut-off value so high.





import org.apache.spark.ml.feature.{RegexTokenizer, Tokenizer}
import org.apache.spark.ml.feature.{HashingTF, IDF}
import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.evaluation.MulticlassMetrics



val trainData = sqlContext.createDataFrame(Seq((0,"wheel airbag engine"),(1,"wing wheel engine"))).toDF("category","text")
    val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
    val wordsData = tokenizer.transform(trainData)
    val hashTF = new HashingTF().setInputCol("words").setOutputCol("features").setNumFeatures(20)
    val featureData = hashTF.transform(wordsData) 
val subFeature = featureData.select("category","features");
val df_1 = subFeature.withColumnRenamed("category","category2")
val trainDataRdd2 = df_1.withColumn("category",df_1.col("category2").cast("double")).drop("category2")


trainDataRdd2.printSchema()
val testScoreAndLabel = trainDataRdd2.select("category","features").map{ case Row(l:Double,p:Vector) => LabeledPoint(l,p) }

    val model = NaiveBayes.train(testScoreAndLabel, lambda = 1.0, modelType = "multinomial")
   //same for the test data
    val testData = sqlContext.createDataFrame(Seq((0,"wing airbag"),(1,"wing airport"),(0,"wing airport"))).toDF("category","text")
    val testWordData = tokenizer.transform(testData)
    val testFeatureData = hashTF.transform(testWordData)
    val testDataRdd = testFeatureData.select("category","features").map {
    case Row(label: Int, features: Vector) =>
    LabeledPoint(label.toDouble, Vectors.dense(features.toArray))
    }
    val testpredictionAndLabel = testDataRdd.map(p => (model.predict(p.features), p.label))


val metrics = new MulticlassMetrics(testpredictionAndLabel)
/* output F1-measure for all labels (0 and 1, negative and positive) */
metrics.labels.foreach( l => println(metrics.fMeasure(l)))
testpredictionAndLabel.take(5)
// Confusion matrix
println("Confusion matrix:")
println(metrics.confusionMatrix)


// Precision by label
val labels = metrics.labels
labels.foreach { l =>
  println(s"Precision($l) = " + metrics.precision(l))
}

// Recall by label
labels.foreach { l =>
  println(s"Recall($l) = " + metrics.recall(l))
}

// False positive rate by label
labels.foreach { l =>
  println(s"FPR($l) = " + metrics.falsePositiveRate(l))
}

// F-measure by label
labels.foreach { l =>
  println(s"F1-Score($l) = " + metrics.fMeasure(l))
}

// Weighted stats
println(s"Weighted precision: ${metrics.weightedPrecision}")
println(s"Weighted recall: ${metrics.weightedRecall}")
println(s"Weighted F1 score: ${metrics.weightedFMeasure}")
println(s"Weighted false positive rate: ${metrics.weightedFalsePositiveRate}")
  

Thursday, December 22, 2016

Spark BinaryClassificationMetrics

After finishing a LogisticRegression we can check if result is good with BinaryClassificationMetrics.
It simply takes 2 parameters.
One is score associated with your predicition.(rawPrediction column after a Logistic Regression) for example.
And other is what you guesses.
For ROC you must have a big area near 1.

What does this mean.
Suppose you are measuring if you use heater according to weather.
(of course this is obvious, we are now doing obvious case)

Say at 10 F : do not use
20 F : do not use
...
50 F : use
..
100 F : use

You see for low scores ,u do not use, but for high ones you use.
True and false it perfectly separated so I expect a perfect ROC.

ROC is a graph showing what we gain as data for calculations we did with Logistic Regression.
for example we can have 4 data for one point. If you check below it means we only use heater once on this period.
So value 10 give 3 0 and 1 1 value. This makes learning of value 10 less efficient.
Intervals must give as much as information as possible.
Purified intervals will only output 1 value for so that information gain. is so high.

( 10.0, 0.0),
( 10.0, 0.0),
( 10.0, 0.0),
( 10.0, 1.0),


val metricData= sc.parallelize(
  
   Seq( 
     ( 10.0,  0.0),
     ( 20.0,  0.0),
     ( 30.0,  0.0),
     ( 40.0,  0.0),
     ( 50.0,  0.0),
     ( 60.0,  1.0),
     ( 70.0,  1.0),
     ( 80.0,  1.0),
     ( 90.0,  1.0),     
     ( 100.0,  1.0)
    
     )
);

val metrics = new BinaryClassificationMetrics(metricData) 
println("area under the precision-recall curve: " + metrics.areaUnderPR)
println("area under the receiver operating characteristic (ROC) curve : " + metrics.areaUnderROC)
metrics.roc().collect()



Above case was so good so metrics are below.

area under the precision-recall curve: 1.0 
area under the receiver operating characteristic (ROC) curve : 0.9999999999999999 
Array[(Double, Double)] = Array((0.0,0.0), (0.0,0.2), (0.0,0.4), (0.0,0.6), (0.0,0.8), (0.0,1.0), (0.2,1.0), (0.4,1.0), (0.6,1.0), (0.8,1.0), (1.0,1.0), (1.0,1.0))







Lets preapre a bad data where distribution is useless.
Think you are measuring your ice-tea consumption according to weather.
As in above you do need have a pattern. You do not drink at 10F but you drink 20 ...
So this is near random distribution. And random distribution gives 0.5 area under curve.
It is 45 degree line. A line like that means on every probability(score) of event
I have equal info from True or False case.


val metricData= sc.parallelize(
  
   Seq( 
     ( 10.0,  0.0),
     ( 20.0,  1.0),
     ( 30.0,  0.0),
     ( 40.0,  1.0),
     ( 50.0,  0.0),
     ( 60.0,  1.0),
     ( 70.0,  0.0),
     ( 80.0,  1.0),
     ( 90.0,  0.0),     
     ( 100.0,  1.0)
    
     )
);

val metrics = new BinaryClassificationMetrics(metricData) 
println("area under the precision-recall curve: " + metrics.areaUnderPR)
println("area under the receiver operating characteristic (ROC) curve : " + metrics.areaUnderROC)
metrics.roc().collect()







Above case was so bad so metrics are below.

area under the precision-recall curve: 0.6393650793650794 
area under the receiver operating characteristic (ROC) curve : 0.6000000000000001 metrics: 
Array[(Double, Double)] = Array((0.0,0.0), (0.0,0.2), (0.2,0.2), (0.2,0.4), (0.4,0.4), (0.4,0.6), (0.6,0.6), (0.6,0.8), (0.8,0.8), (0.8,1.0), (1.0,1.0), (1.0,1.0))












ChiSqSelector for Top Feature Selection

When i get data first I try to understand most important column and relation between columns.
Relation between columns is coefficient matrix which i will give a sample in another post.

Think you are guessing if someone is women or not based on
foot,chest,height,hair measurements.
You do not know which parameter effects much.
You run a ChiSqSelector test to understand most important n parameters.
setNumTopFeatures controls the n number.

import org.apache.spark.ml.feature.ChiSqSelector
import org.apache.spark.ml.linalg.Vectors

val data = Seq(
  //            foot,chest,height,hair  women
  (7, Vectors.dense(36, 90.0, 165.0, 20.0), 1.0),
  (8, Vectors.dense(38, 95.0, 170.0, 25.0), 1.0),
  (8, Vectors.dense(41, 60.0, 178.0, 10.0), 0.0),
  (9, Vectors.dense(42.0, 60.0, 165.0, 5.1), 0.0)
)

val df = spark.createDataset(data).toDF("id", "features", "clicked")

val selector = new ChiSqSelector()
  .setNumTopFeatures(2)
  .setFeaturesCol("features")
  .setLabelCol("clicked")
  .setOutputCol("selectedFeatures")

val result = selector.fit(df).transform(df)
result.show()

Result is as below.
You see selected features column.
Since setNumTopFeatures is 2 , there are 2 columns in result.
According to ChiSqSelector , most important column is foot and then chest.

| id| features|clicked|selectedFeatures|
| 7|[36.0,90.0,165.0,...| 1.0| [36.0,90.0]|
| 8|[38.0,95.0,170.0,...| 1.0| [38.0,95.0]|
| 8|[41.0,60.0,178.0,...| 0.0| [41.0,60.0]|
| 9|[42.0,60.0,165.0,...| 0.0| [42.0,60.0]|

Sunday, December 18, 2016

Sample Size and Standart error




You record your sleeping for 1 year lets say.
And you find average to be 10 hours.(This is average of all individual days)

And you find population standard deviation as 2.

Leftmost part in picture.

You randomly take 10 days within a year and take average.
Repeat this process for lots of time to take average of different "10 samples" .
*** Be careful you now took average of only a sample subset.

Middle part in picture.

Standard error is = 2 / √ 10 
Standard error = 0.63
This value means with 10 samples we are 0.63 far from real mean which is 10.

You can see that sample graph


If we get 100 samples it is more centered around real mean.
Standard error is = 2 / √ 100 
Standard error = 0.2


More samples the more you are near to Population mean.

Standard error = Standard Deviation / SQRT( Sample Size)

So if Standard deviation is high, (your data varies a lot ) , you will have big error for samples.

Saturday, December 17, 2016

Spark Apply Descriptive Statistics on DataFrame

When you first get your data you have to play with it.
You want to learn what kind of data you have.
Below is a simple code piece to begin investigating general properties of your data.

Suppose you have a data like 48,49,50,51,52. This is well distributed homogenous data.

import org.apache.commons.math3.stat.descriptive._

val df = Seq(48,49.0, 50.0, 51.0,52.0).toDF("nums")

val mean = df.select("nums").rdd.map(row => row(0).asInstanceOf[Double]).collect()

val arrMean = new DescriptiveStatistics()
genericArrayOps(mean).foreach(v => arrMean.addValue(v))

val meanQ1 = arrMean.getPercentile(25)
val meanQ3 = arrMean.getPercentile(75)
val meanIQR = meanQ3 - meanQ1





Perfect distribution
47,48,49,50,51
n: 5 
min: 48.0 max: 52.0 mean: 50.0 
std dev: 1.5811388300841898 
median: 50.0 
skewness: 0.0 
kurtosis: -1.200000000000002 
meanQ1: Double = 48.5 
meanQ3: Double = 51.5 
meanIQR: Double = 3.0



Lets form a line shaped distribution
val df = Seq(50,50, 50, 50,50.0).toDF("nums")

n: 5 
min: 50.0 
max: 50.0 
mean: 50.0 
std dev: 0.0 
median: 50.0 
skewness: NaN 
kurtosis: NaN 
meanQ1: Double = 50.0 
meanQ3: Double = 50.0 
meanIQR: Double = 0.0

Lets add 40 to make left skew(negative skew.
** Skewness is asymmetry of distribution about mean.



Left tail (skew ) distribution

val df = Seq(40,48,49, 50, 51,52.0).toDF("nums")
n: 6 
min: 40.0 
max: 52.0 
mean: 48.333333333333336 
std dev: 4.320493798938574 
median: 49.5 
skewness: -1.8805720776629977 
kurtosis: 3.9187500000000064 
meanQ1: Double = 46.0 
meanQ3: Double = 51.25 
meanIQR: Double = 5.25


Rigth tail (skew ) distribution
If we just add 60 to original series we get a right tail distribution.
Skewness is same with different sign.
val df = Seq(48,49, 50, 51,52.0,60).toDF("nums")

n: 6 
min: 48.0 
max: 60.0 
mean: 51.666666666666664 
std dev: 4.320493798938574 
median: 50.5 
skewness: 1.8805720776629975 
kurtosis: 3.9187500000000064

meanQ1: Double = 48.75 
meanQ3: Double = 54.0 
meanIQR: Double = 5.25

meanIQR is a data without boundaries. So it gives lots of idea if you know your domain.
For example you have a car price data. You know that car must be around 50.000$.
When you check meanIQR you will see datas near to your expectation. Others will
have have meaningless high( irreal expectation of seller) or low( this time meaningful because
car could be damaged.) meanIQR is a nice measure.

Skewness can give a rough idea about tendency of data. (Data having a tail to left if minus.)

kurtosis is a measure of shape. The sharper the top the higher the kurtosis. Check picture from internet please.













Monday, November 7, 2016

SAP HANA clone user

SAP HANA have copy user function.
That function only clones roles given by _SYS_REPO.
Also function does not copy any of analytical privileges.

Below piece of sql creates a function to clone these objects.

Code checks if user exists if not create.
Only select roles with GRANTOR "_SYS_REPO"( this is my business requirement. You can change the way you want.
Only select privileges of type SQLANALYTICALPRIVILEGE and ANALYTICALPRIVILEGE.

use GRANT_ACTIVATED_ROLE and GRANT_ACTIVATED_ANALYTICAL_PRIVILEGE methods go grant these objects.

CREATE PROCEDURE clone_user (IN V_SOURCE_USER NVARCHAR(256), IN V_TARGET_USER NVARCHAR(256), IN V_TARGET_PWD NVARCHAR(256)) 
LANGUAGE SQLSCRIPT 
AS 
V_FOUND INT := 1;

CURSOR C_ROLE_LIST (V_SRC_USER NVARCHAR(256)) FOR
SELECT ROLE_NAME, IS_GRANTABLE FROM "SYS"."GRANTED_ROLES" WHERE GRANTEE=:V_SRC_USER AND GRANTEE_TYPE='USER' and GRANTOR = '_SYS_REPO' and  ROLE_NAME <> 'PUBLIC' ;

CURSOR C_PRIV_LIST (V_SCR_USER NVARCHAR(256)) FOR
SELECT OBJECT_NAME, PRIVILEGE, IS_GRANTABLE FROM "SYS"."GRANTED_PRIVILEGES" WHERE GRANTEE=:V_SCR_USER AND  (OBJECT_TYPE = 'SQLANALYTICALPRIVILEGE' or OBJECT_TYPE = 'ANALYTICALPRIVILEGE' );  
BEGIN

SELECT COUNT(*) INTO V_FOUND FROM "SYS"."USERS" WHERE USER_NAME = :V_TARGET_USER;
IF :V_FOUND = 0 THEN
EXEC 'CREATE USER ' || :V_TARGET_USER || ' PASSWORD ' || :V_TARGET_PWD;
END IF;

FOR V_LIST_ROW AS C_ROLE_LIST(:V_SOURCE_USER) DO
EXEC 'CALL GRANT_ACTIVATED_ROLE( ''' || V_LIST_ROW.ROLE_NAME || '''  , ''' ||  V_TARGET_USER ||  ''' ) ' ;
END FOR;

FOR V_LIST_ROW AS C_PRIV_LIST(:V_SOURCE_USER) DO
EXEC 'CALL GRANT_ACTIVATED_ANALYTICAL_PRIVILEGE( ''"' || V_LIST_ROW.OBJECT_NAME || '"''  , ''' ||  V_TARGET_USER ||  ''' ) ' ;    
END FOR;

END;

Tuesday, October 18, 2016

Zookeeper not starting

When you want to start zookeeper for the 1st time ,it does not start and does not give any error message.
At those times start zookeper with

./zkServer.sh start-foreground

This will bring the error into console. In my case error was

Using config: /home/myuser/zookeeper/bin/../conf/zoo.cfg
2016-10-18 12:25:54,423 - INFO  [main:QuorumPeerConfig@90] - Reading configuration from: /home/myuser/zookeeper/bin/../conf/zoo.cfg
2016-10-18 12:25:54,429 - INFO  [main:QuorumPeerConfig@310] - Defaulting to majority quorums
2016-10-18 12:25:54,431 - FATAL [main:QuorumPeerMain@83] - Invalid config, exiting abnormally
org.apache.zookeeper.server.quorum.QuorumPeerConfig$ConfigException: Error processing /home/myuser/zookeeper/bin/../conf/zoo.cfg
at org.apache.zookeeper.server.quorum.QuorumPeerConfig.parse(QuorumPeerConfig.java:110)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:99)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:76)
Caused by: java.lang.IllegalArgumentException: /data/zookeeper/myid file is missing
at org.apache.zookeeper.server.quorum.QuorumPeerConfig.parseProperties(QuorumPeerConfig.java:320)
at org.apache.zookeeper.server.quorum.QuorumPeerConfig.parse(QuorumPeerConfig.java:106)
... 2 more
Invalid config, exiting abnormally


It states "/data/zookeeper/myid file is missing"
Every zookeeper instance must announce its identity.

From Apache documentation :

server.x=[hostname]:nnnnn[:nnnnn], etc
(No Java system property)

servers making up the ZooKeeper ensemble. When the server starts up, it determines which server it is by looking for the file myid in the data directory. That file contains the server number, in ASCII, and it should match x in server.x in the left hand side of this setting.


It means we must open a file as myid data folder and for a config like below

server.1=m2.mydomain.com:2888:3888
server.2=m3.mydomain.com:2888:3888
server.3=m5.mydomain.com:2888:3888

we must open myid for m2.mydomain and write just 1 into it.

Sunday, October 2, 2016

Mean Impute for Missing data

Sometimes we receive a data with lots of null cells for analyzing. For example a have a car data

root
|-- product: string (nullable = true)
|-- price: integer (nullable = true)
|-- km: double (nullable = true)
|-- color: string (nullable = true)
|-- fuel: string (nullable = true)
|-- city: string (nullable = true)
|-- gear: string (nullable = true)
|-- year: integer (nullable = true)

Mostly people do not input km. If i remove the item with null km i end up with a sparse dataset.
Unwillingly I do Impute Meaning, which is not in fact a very good thing because variances and regression
will be effected.



training = dataset.withColumn("label", dataset['km']*1.0).na.replace(0,averagekm)

Monday, August 29, 2016

SAP Hana procedure, Could not create catalog object: insufficient privilege on creating a Stored Procedure

"insufficient privilege" is a common error while struggling with SAP Hana.
Easiest way is to configure a trace for your user.

This I was creating a stored procedure and got this error.

[42227]{251259}[51/87698283] 2016-08-29 16:15:48.131475 i Authorization SQLFacade.cpp(01415) : UserId(141949) is not authorized to do SELECT on ObjectId(2,0,oid=11254798)
[42227]{251259}[51/87698283] 2016-08-29 16:15:48.131572 i Authorization SQLFacade.cpp(01961) :



Select * from objects where object_oid = '11254798';
Select * from users where user_id = 141949;

User was SYS_REPO and object was table type i use in procedure output DAILY_REPLICATION.

PROCEDURE "MYUSER"."RFX_Business_Scenario::chk_lastweek_withno_data" ( out OUTPUT_TABLE "MYUSER"."DAILY_REPLICATION" )


I gave select to _SYS_REPO on myschema and created my procedure successfully.

GRANT SELECT ON SCHEMA MYSCHEMA TO _SYS_REPO WITH GRANT OPTION;


In logs u can see user trying to do these jobs is SYS_REPO.

User _SYS_REPO tried to execute 'CREATE PROCEDURE

SAP Hana Store procedure to find days where no transaction occurred on our system, OPInt Business Situation

Think you have a system where people create transactions.
You want to find the days which people generated no transactions.

Lets assume you want to check last week.

Define a UNION ALL block to manually generate days of last week
Query day differences of transaction and today
Find days which is not in above difference data.

You can use this as detection procedure in OPINt.



SELECT mdays.mday
FROM
(
SELECT 1 mday from dummy UNION ALL
SELECT 2 mday from dummy UNION ALL
SELECT 3 mday from dummy UNION ALL
SELECT 4 mday from dummy UNION ALL
SELECT 5 mday from dummy UNION ALL
SELECT 6 mday from dummy UNION ALL
SELECT 7 mday from dummy
) mdays
where mdays.mday not in (

select
distinct DAYS_BETWEEN (to_date("CREATED_AT" ),now())
from
"MY_SCHEMA"."TRANSACTION_TABLE"
where DAYS_BETWEEN (to_date("CREATED_AT" ),now()) < 10
)

Thursday, August 25, 2016

SAP OPInt Business Situation

Think that you want to have an alarm when there is no order created in a particular day.

In OPInt this is achieved by creating a new Business Situation with a detection procedure.

A detection procedure returns data in table format.
For example you want to find days which you create any purchase order in last 10 days.

PROCEDURE "TABLESCHEMA"."RFX_Business_Scenario::chk_replication" ( out OUTPUT_TABLE "YOURSCHEMA"."DAILYCOUNT" )
LANGUAGE SQLSCRIPT
SQL SECURITY INVOKER

READS SQL DATA AS
BEGIN
var1 = SELECT
top 20
LPAD(CAST(LPAD(cast("CREATED_AT" as NVARCHAR),14,'0') AS timestamp),10,'0') , count(*)

FROM "SRM_GRPQ_REPL"."CRMD_ORDERADM_H"
where
LPAD(CAST(LPAD(cast("CREATED_AT" as NVARCHAR),14,'0') AS timestamp),10,'0') < LPAD( now(),10,'0')

group by LPAD(CAST(LPAD(cast("CREATED_AT" as NVARCHAR),14,'0') AS timestamp),10,'0')
order BY LPAD(CAST(LPAD(cast("CREATED_AT" as NVARCHAR),14,'0') AS timestamp),10,'0') DESC ;



OUTPUT_TABLE = SELECT count(*) as TOTAL_COUNT FROM :var1 ;
END;

SAP Hana, java.lang.IllegalStateException: java.sql.SQLException: Table type not found

I try to write rare errors i got during SAP HANA jobs. Usually I can not find an entry in INTERNET about the error.

Below error was occurring during activation of a scenario.

sap hana java.lang.IllegalStateException: java.sql.SQLException: Table type not found


I created a stored procedure as below

PROCEDURE "SRM_GRP_REPL"."RFX_Business_Scenario::chk_replication" ( out OUTPUT_TABLE table(TOTAL_COUNT integer) )

calling this function is legal. But when I use this in a scenario it generates error.
Unexpected problem ocurred while validating scenario artifacts
java.lang.IllegalStateException: java.sql.SQLException: Table type not found





Probably OPInt have difficulty with a temp table declaration. ( table(TOTAL_COUNT integer) )

Converting it into a real table type fixed the problem.

PROCEDURE "TARGET_SCHEMA"."RFX_Business_Scenario::chk_replication" ( out OUTPUT_TABLE "MYSCHEMA"."DAILYCOUNT" )

Wednesday, August 24, 2016

User is not authorized to query for ProcessDefinition


Sometimes OPInt problems do not have hit on internet. I try to write problem as I see to help others
and for me to remember what i have to do if I encounter same error again.

Exception

User xyzis not authorized to search for processes in 10.6.xxx.xxx Contact your administrator for the required authorizations.

com.sap.pv.opm.discovery.bpm.search.BPMSearchOperation$BPMRequestFailedException: com.sap.visibility.bpm.facade.service.ExecutionFault: User is not authorized to query for ProcessDefinition.
at com.sap.pv.opm.discovery.bpm.search.BPMSearchOperation.run(BPMSearchOperation.java:45)
at com.sap.pv.opm.discovery.bpm.search.BPMSearchOperation.run(BPMSearchOperation.java:1)
at com.sap.pv.opm.discovery.common.operation.CancellableLongOperation$WrapperThread.run(CancellableLongOperation.java:70)
Caused by: com.sap.visibility.bpm.facade.service.ExecutionFault: User is not authorized to query for ProcessDefinition.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.sun.xml.internal.ws.fault.SOAPFaultBuilder.createException(SOAPFaultBuilder.java:135)
at com.sun.xml.internal.ws.client.sei.StubHandler.readResponse(StubHandler.java:238)

Solution :

Assign a UME action SAP_BPM_EXPORT_MODEL to related user.

From SAP Documentation :

Additional Security Configuration
The SAP HANA user who discovers SAP Business Process Management process definitions needs to have authorization as an SAP Gateway user. To consume process definitions from the SAP Business Process Management system through the SAP Gateway, the user that is authenticated in the SAP Business Process Management system needs to be assigned a UME action SAP_BPM_EXPORT_MODEL. We also recommend the following:
SAP_BPM_EXPORT_MODEL action is assigned to the SAP_BPM_SuperDisplay role.
SAP_BPM_EXPORT_MODEL action is assigned to the Support user.
To enable the user to see business scenario definitions in the space.me workspace, you need to assign the user the respective SQL and analytic privileges. For more information, see .

Saturday, July 9, 2016

The cost of rewrite : Business Object

A business object requiring a change is a headache in project.

Lets say u need to add a new status field to an existing bean.

Run sql scripts and add these to migration scripts.

Then you have a completely new object. You must change all constructors.

Check member methods to change related member methods.

Check service methods.

Add tests to unit tests.

...

List is long. Often managers underestimate these kind of things.
"I only wanted one new field"
Maybe change seems simple but it breaks all former code.

Friday, July 8, 2016

Junior Idea : "Why do not we write a new program"

All juniors(including me) comes with the idea of rewriting the program in 1st month of employment.
And when they are rejected they are surprised.
They think they offer a better version but they were rejected :)

What one understands over time :

1)A full complete project running in an old language/style is much better
then an unknown version with new language/style.

2)Juniors are too fast to propose a rewrite. It is obvious that they did not see
whole system and is not aware of former problems.

3)A junior is a new worker. It is not certain how long he will work at
that place. So no company can depend on him.

Why projects fail?


In Programming world , you always see unsatisfied customers and developers.
After years of development I can understand a project will fail or not from beginning
if there are some signs like below.

1) Low price
Projects specs are announced and price is asked to companies.And then companies bid for lowest price.
What is lowest price? Can you afford to make a project will 2 juniors instead of 1 senior?
1 2 3 or whatever Junior is not one senior.
There is no replacement in programming.

2) Who knows the industry?
Programmers often underrate project domain. They think most important thing in project
is technical part. I never see a big project where domain knowledge is not important.
Programmers must respect domain knowledge and understand it as much as they can.

3)Unclear breakpoints.
At 1st stages people code fast and come to a stage.
At that stage ,customer says this is not what we want.

Will you continue with problematic logic?
Can you afford a re-write?

Program plan must have some milestones and specs to reply
these questions.



My simple rule for a project :

A project will fail or will be in a low quality :

If there is no one int the team who has done a technical similar project


Monday, June 27, 2016

SAP Hana Analytical Privilege for Team Logic



You have a context calculation view and want to implement a manager user who can see content of his team .

For this purpose you opened an Authorization table with 2 columns (KEY_NAME KEY_VALUE)


Table AUTH_OPINT

KEY_NAME KEY_VALUE
U1 U4,U5
   



According to this table U1 has a team consisting of U2 and U3 .

Your view has a column as approverid, which is the column for storing approverid.
By definition a manager user can see items he approved or items approved by his team.

We have entries at this table for defining our teams , so a query like below :
select KEY_VALUE from YOURSCHEMA.AUTH_OPINT where KEY_NAME = SESSION_USER

will bring "U4,U5"

instr is a function searching second argument in 1st, so
instr( "U4,U5","approverid" ) will return bigger than 1 if approverid is either U4 or U5



TOTAL SCRIPT

("approverid" = SESSION_USER ) or
(
instr( (select KEY_VALUE from YOURSCHEMA.AUTH_OPINT where KEY_NAME = SESSION_USER) ,"approverid" ) >= 1
)

SAP Hana Analyrical Privilege Implement Super User


You have a context calculation view and want to implement a super user who has access to all contents in cockpit.


For this purpose you opened an Authorization table with 2 columns (KEY_NAME KEY_VALUE)


Table AUTH_OPINT

KEY_NAME KEY_VALUE
SUPERUSER U1,U2,U3
   


According to this table U1, U2 and U3 are super users.
When you define below sql as dynamic sql

It will return 'SUPERUSER' if current SESSION_USER is a substring of KEY_VALUE ( "U1,U2,U3" ).



'SUPERUSER' =
(
SELECT KEY_NAME FROM YOURSCHEMA.AUTH_OPINT where KEY_NAME = 'SUPERUSER'
and KEY_VALUE like concat(concat('%', SESSION_USER ),'%')
)

Tuesday, June 14, 2016

SAP OPINT problem Failed to acquire scenario lock; either a background process is processing the scenario or another user is performing operations on the scenario:



when you try to generate a scenario on Hana Studio you get following error :

Failed to acquire scenario lock; either a background process is processing the scenario or another user is performing operations on the scenario:

You have to execute below Sql to delete locks by replacing .
Package Name is project name under contennt.
Scenario name is name of hprbusinessscenario file name.

If you are not sure about this name open table
"SYS_PROCESS_VISIBILITY"."sap.opi.pv::SPVD_SCENARIO_METADATA"
and search scenario_name and find full qualified name.


delete from "SYS_PROCESS_VISIBILITY" . "sap.opi.pv::SPVR_SCENARIO_DEFINITION_LOCK" where
"SCENARIO_DEFINITION_ID" in (
select "SCENARIO_DEF_ID" from "SYS_PROCESS_VISIBILITY" . "sap.opi.pv::SPVD_SCENARIO_METADATA"
t1 join "SYS_PROCESS_VISIBILITY" . "sap.opi.pv::SPVR_SCENARIO_DEFINITION_LOCK" t2 on
t1.SCENARIO_DEF_ID = t2.SCENARIO_DEFINITION_ID
where t1. "SCENARIO_NAME" = '' group by "SCENARIO_DEF_ID" )

Saturday, June 11, 2016

SAP OPInt , Reporting Logic for Workflow

In classical programming we have database which is a row based representation of data.
Rows point each other and we have complete logic in code.

In BPM programming we have flow at BPM and data is either at db or BPM engine.

Classical reports are aggregation of columns. Aggregations are
sum,average,count .... which is a summary data for mass data.

But our jobs are not consisting of aggregates. Jobs consist of task. So classical reporting does not show
anything about smallest piece of our job.

SAP OPInt is a reporting tool on tasks.
It shows you which tasks are at critical,
which processes are overdue.

It show your task flow and can predict completion time of tasks.
It enables you to take preemptive action about your tasks.

SAP HANA privilege Problems



When you get an authorization error on SAP Hana, you can view exact error from trace file.

Open trace configuration tab as in picture.
Click new trace
Input user name
Select the "authorization" from Indexserver.

When you are finished ,repeat what you have done again.
Go to diagnosis tab, find your file(best is sort by date)

Download and check end of trace.
You will see a trace like :

User XXX is not authorized to access _SYS_BIC.XY_Business_Scenario/SPVR_YX_SCENARIO_CONTEXT because he is missing any structured privileges applicable to that object

Check your exception and do appropriate action.

SAP Open SQL Upper Lower Case

Querying HANA Calculation views from Java

Say you want to extract data from a calculation view.
You need to define a jdbc data source ( SYS_BICDB )on netweaver administration pages.

( Configuration -> Infrastructure -> Application Resources )

Then you will refer calculation view as follows :

String sql = "SELECT GROUP FROM \"_SYS_BIC\".\"MY_Business_Scenario/SPVR_MY_SCENARIO_CONTEXT\" where DEF = '" + rfxId + "' ";

MY_Business_Scenario : name of your scenario
SPVR_MY_SCENARIO_CONTEXT : name of your calculation view.

public String getProcGroup(String rfxId) throws Exception {

  Connection con = null;

  String sql = "SELECT GROUP  FROM \"_SYS_BIC\".\"MY_Business_Scenario/SPVR_MY_SCENARIO_CONTEXT\" where DEF =  '" + rfxId + "' ";

  try {
   InitialContext ctx = new InitialContext();
   DataSource dsource = null;

   dsource = (DataSource) ctx.lookup("jdbc/notx/SYS_BICDB");
   con = dsource.getConnection();

   PreparedStatement q = NativeSQLAccess.prepareNativeStatement(con, sql);

   ResultSet rs = q.executeQuery();
   
   
   String procGroup = "";
   while (rs.next()) {

    procGroup = (String) rs.getString(1);
    
   }

   con.close();
   return procGroup;
  } catch (SQLException e) {
   e.printStackTrace();
   logger.errorT("getProcGroupFail for " + rfxId + " : " + e.getMessage());
  } finally {
   if (con != null) {
    con.close();
   }

  }
  return null;
 }