one graph is extremely useful for investment.
Think you have(actually we have) thousand of used car prices. And you want to buy a car
with optimum price for investment.(In hope you will sell later)
Best Car to buy Price = Max(Expected Price by regression - Real price )
Think our regression line expects a car price to be 30K but actual advertisement price is 20K.
There are 2 possibilities.
1)Car is damaged.
2)Car owner needs urgent money and selling his car with a very low price.
If car price is 40K. I can not find a logical explanation for this. Some people
are trying to sell their used cars with more price than an-unused one. Probably
they spend for some amenities which they think so valuable.
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val dataset = spark.createDataFrame(
Seq(
(20000,2011,30000.0),
(120000,2014,20000.0),
(60000,2015,25000.0) ,
(20000,2011,32000.0),
(120000,2014,21000.0),
(60000,2015,45000.0)
)
).toDF("km", "year", "label")
val assembler = new VectorAssembler()
.setInputCols(Array("km", "year"))
.setOutputCol("features")
val output = assembler.transform(dataset)
output.select("features", "label").show(false)
import org.apache.spark.ml.regression.LinearRegression
val lr = new LinearRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
val lrModel = lr.fit(output)
display(lrModel, output, "fittedVsResiduals")
println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")
val trainingSummary = lrModel.summary
println(s"numIterations: ${trainingSummary.totalIterations}")
println(s"objectiveHistory: [${trainingSummary.objectiveHistory.mkString(",")}]")
trainingSummary.residuals.show()
println(s"RMSE: ${trainingSummary.rootMeanSquaredError}")
println(s"r2: ${trainingSummary.r2}")
Coefficients: [-0.19283786035452538,2928.112739446878] Intercept: -5853577.79139608
numIterations: 8
objectiveHistory: [0.5,0.446369757257132,0.352850077757605,0.272318835721877,0.26365142412164966,0.23726105027025182,0.23725993458647637,0.23725993457463043]
+-------------------+
| residuals|
+-------------------+
|-1000.1704245014116|
|-500.72260739002377|
| -9999.106968107633|
| 999.8295754985884|
| 499.27739260997623|
| 10000.893031892367|
+-------------------+
Residual = Observed value - Predicted value
We must find the ones with most negative residual.(Much cheaper than expected).
Databricks graph has bad resolution for few points so i wrote R version also.
library(lattice)
mydata2 = data.frame(
year = c(2011.0,2012.0,2014.0,2015.0),
km10000 = c(6.0,7.0,10.0,3.0),
price1000 = c(200.0,250.0,300.0,400.0)
)
res2.mod1 = lm(price1000 ~ km10000 + year , data = mydata2)
summary(res.mod1)
fitted(res2.mod1)
xyplot(resid(res2.mod1) ~ fitted(res2.mod1),
xlab = "Fitted Values",
ylab = "Residuals",
main = "Car price based on year and km ",
par.settings = simpleTheme(col=c("blue","red"),
pch=c(10,3,11), cex=3, lwd=2),
panel = function(x, y, ...)
{
panel.grid(h = -1, v = -1)
panel.abline(h = 0)
panel.xyplot(x, y, ...)
}
)
> fitted(res2.mod1)
1 2 3 4
206.1722 240.9232 302.5415 400.3631
> resid(res2.mod1)
1 2 3 4
-6.1721992 9.0767635 -2.5414938 -0.3630705
>

No comments:
Post a Comment