one graph is extremely useful for investment.
Think you have(actually we have) thousand of used car prices. And you want to buy a car
with optimum price for investment.(In hope you will sell later)
Best Car to buy Price = Max(Expected Price by regression - Real price )
Think our regression line expects a car price to be 30K but actual advertisement price is 20K.
There are 2 possibilities.
1)Car is damaged.
2)Car owner needs urgent money and selling his car with a very low price.
If car price is 40K. I can not find a logical explanation for this. Some people
are trying to sell their used cars with more price than an-unused one. Probably
they spend for some amenities which they think so valuable.
import org.apache.spark.ml.feature.VectorAssembler import org.apache.spark.ml.linalg.Vectors val dataset = spark.createDataFrame( Seq( (20000,2011,30000.0), (120000,2014,20000.0), (60000,2015,25000.0) , (20000,2011,32000.0), (120000,2014,21000.0), (60000,2015,45000.0) ) ).toDF("km", "year", "label") val assembler = new VectorAssembler() .setInputCols(Array("km", "year")) .setOutputCol("features") val output = assembler.transform(dataset) output.select("features", "label").show(false) import org.apache.spark.ml.regression.LinearRegression val lr = new LinearRegression() .setMaxIter(10) .setRegParam(0.3) .setElasticNetParam(0.8) val lrModel = lr.fit(output) display(lrModel, output, "fittedVsResiduals")
println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}") val trainingSummary = lrModel.summary println(s"numIterations: ${trainingSummary.totalIterations}") println(s"objectiveHistory: [${trainingSummary.objectiveHistory.mkString(",")}]") trainingSummary.residuals.show() println(s"RMSE: ${trainingSummary.rootMeanSquaredError}") println(s"r2: ${trainingSummary.r2}") Coefficients: [-0.19283786035452538,2928.112739446878] Intercept: -5853577.79139608 numIterations: 8 objectiveHistory: [0.5,0.446369757257132,0.352850077757605,0.272318835721877,0.26365142412164966,0.23726105027025182,0.23725993458647637,0.23725993457463043] +-------------------+ | residuals| +-------------------+ |-1000.1704245014116| |-500.72260739002377| | -9999.106968107633| | 999.8295754985884| | 499.27739260997623| | 10000.893031892367| +-------------------+
Residual = Observed value - Predicted value
We must find the ones with most negative residual.(Much cheaper than expected).
Databricks graph has bad resolution for few points so i wrote R version also.
library(lattice) mydata2 = data.frame( year = c(2011.0,2012.0,2014.0,2015.0), km10000 = c(6.0,7.0,10.0,3.0), price1000 = c(200.0,250.0,300.0,400.0) ) res2.mod1 = lm(price1000 ~ km10000 + year , data = mydata2) summary(res.mod1) fitted(res2.mod1) xyplot(resid(res2.mod1) ~ fitted(res2.mod1), xlab = "Fitted Values", ylab = "Residuals", main = "Car price based on year and km ", par.settings = simpleTheme(col=c("blue","red"), pch=c(10,3,11), cex=3, lwd=2), panel = function(x, y, ...) { panel.grid(h = -1, v = -1) panel.abline(h = 0) panel.xyplot(x, y, ...) } )
> fitted(res2.mod1) 1 2 3 4 206.1722 240.9232 302.5415 400.3631 > resid(res2.mod1) 1 2 3 4 -6.1721992 9.0767635 -2.5414938 -0.3630705 >
No comments:
Post a Comment