java SAP Hana Spark : Residual vs Fitted for investment

While I was checking graphs i can draw from a regression model I realized
one graph is extremely useful for investment.
Think you have(actually we have) thousand of used car prices. And you want to buy a car
with optimum price for investment.(In hope you will sell later)

Best Car to buy Price = Max(Expected Price by regression - Real price )

Think our regression line expects a car price to be 30K but actual advertisement price is 20K.
There are 2 possibilities.
1)Car is damaged.
2)Car owner needs urgent money and selling his car with a very low price.

If car price is 40K. I can not find a logical explanation for this. Some people
are trying to sell their used cars with more price than an-unused one. Probably
they spend for some amenities which they think so valuable.

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors

val dataset = spark.createDataFrame(
  Seq(
  (20000,2011,30000.0),
  (120000,2014,20000.0),
  (60000,2015,25000.0) ,
  (20000,2011,32000.0),
  (120000,2014,21000.0),
  (60000,2015,45000.0)   
  
  )
).toDF("km", "year", "label")

val assembler = new VectorAssembler()
  .setInputCols(Array("km", "year"))
  .setOutputCol("features")

val output = assembler.transform(dataset)
output.select("features", "label").show(false)

import org.apache.spark.ml.regression.LinearRegression

val lr = new LinearRegression()
  .setMaxIter(10)
  .setRegParam(0.3)
  .setElasticNetParam(0.8)


val lrModel = lr.fit(output)

display(lrModel, output, "fittedVsResiduals")

println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")





val trainingSummary = lrModel.summary

println(s"numIterations: ${trainingSummary.totalIterations}")

println(s"objectiveHistory: [${trainingSummary.objectiveHistory.mkString(",")}]")

trainingSummary.residuals.show()

println(s"RMSE: ${trainingSummary.rootMeanSquaredError}")

println(s"r2: ${trainingSummary.r2}")

Coefficients: [-0.19283786035452538,2928.112739446878] Intercept: -5853577.79139608
numIterations: 8
objectiveHistory: [0.5,0.446369757257132,0.352850077757605,0.272318835721877,0.26365142412164966,0.23726105027025182,0.23725993458647637,0.23725993457463043]
+-------------------+
|          residuals|
+-------------------+
|-1000.1704245014116|
|-500.72260739002377|
| -9999.106968107633|
|  999.8295754985884|
| 499.27739260997623|
| 10000.893031892367|
+-------------------+

Residual = Observed value - Predicted value

We must find the ones with most negative residual.(Much cheaper than expected).
Databricks graph has bad resolution for few points so i wrote R version also.

library(lattice) 
mydata2 = data.frame(
  year = c(2011.0,2012.0,2014.0,2015.0),
  km10000 = c(6.0,7.0,10.0,3.0),
  price1000 = c(200.0,250.0,300.0,400.0)
)



res2.mod1 = lm(price1000 ~  km10000 + year , data = mydata2)
summary(res.mod1)
fitted(res2.mod1)
xyplot(resid(res2.mod1) ~ fitted(res2.mod1),
       xlab = "Fitted Values",
       ylab = "Residuals",
       main = "Car price based on year and km ",
       par.settings = simpleTheme(col=c("blue","red"),
                                  pch=c(10,3,11), cex=3, lwd=2),
       
       panel = function(x, y, ...)
       {
         panel.grid(h = -1, v = -1)
         panel.abline(h = 0)
         panel.xyplot(x, y, ...)
       }
)

> fitted(res2.mod1)
       1        2        3        4 
206.1722 240.9232 302.5415 400.3631 
> resid(res2.mod1)
         1          2          3          4 
-6.1721992  9.0767635 -2.5414938 -0.3630705 
>

java SAP Hana Spark

Saturday, January 28, 2017

Residual vs Fitted for investment

No comments:

Post a Comment

Blog Archive

Labels