Sklearn+python: linear regression case

Predict Boston housing prices using first-order linear equations

The loaded data is released with sklearn, and comes from the data and prices of 506 houses collected by boston before 1993. load_boston() is used to load data.

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import time
from sklearn.linear_model import LinearRegression

boston =load_boston()

X = boston.data
y = boston.target

print("X.shape:{}. y.shape:{}".format(X.shape, y.shape))print('boston.feature_name:{}'.format(boston.feature_names))

X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.2, random_state=3)

model =LinearRegression()

start = time.clock()
model.fit(X_train, y_train)

train_score = model.score(X_train, y_train)
cv_score = model.score(X_test, y_test)print('time used:{0:.6f}; train_score:{1:.6f}, sv_score:{2:.6f}'.format((time.clock()-start),
         train_score, cv_score))

The output content is:

X.shape:(506,13). y.shape:(506,)
boston.feature_name:['CRIM''ZN''INDUS''CHAS''NOX''RM''AGE''DIS''RAD''TAX''PTRATIO''B''LSTAT']
time used:0.012403; train_score:0.723941, sv_score:0.794958

It can be seen that the accuracy rate on the test set is not high, it should be under-fitting.

Use polynomials for linear regression

The above example is under-fitting, indicating that the model is too simple to fit the data. Now increase model complexity and introduce polynomials.

For example, if the original feature is two features [a, b],

When the degree is 2, the polynomial characteristic becomes [1, a, b, a^2, ab, b^2]. The case where degree is other values can be deduced by analogy.

The polynomial feature is equivalent to increasing the complexity of the data and the model, enabling better fitting.

The following code uses Pipeline to connect the polynomial feature and the linear regression feature, and finally tests the score when the degree is 1, 2, and 3.

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import time
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

def polynomial_model(degree=1):
 polynomial_features =PolynomialFeatures(degree=degree, include_bias=False)

 linear_regression =LinearRegression(normalize=True)
 pipeline =Pipeline([('polynomial_features', polynomial_features),('linear_regression', linear_regression)])return pipeline

boston =load_boston()
X = boston.data
y = boston.target
print("X.shape:{}. y.shape:{}".format(X.shape, y.shape))print('boston.feature_name:{}'.format(boston.feature_names))

X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.2, random_state=3)for i inrange(1,4):print('degree:{}'.format( i ))
 model =polynomial_model(degree=i)

 start = time.clock()
 model.fit(X_train, y_train)

 train_score = model.score(X_train, y_train)
 cv_score = model.score(X_test, y_test)print('time used:{0:.6f}; train_score:{1:.6f}, sv_score:{2:.6f}'.format((time.clock()-start),
         train_score, cv_score))

The output is:

X.shape:(506,13). y.shape:(506,)
boston.feature_name:['CRIM''ZN''INDUS''CHAS''NOX''RM''AGE''DIS''RAD''TAX''PTRATIO''B''LSTAT']
degree:1
time used:0.003576; train_score:0.723941, sv_score:0.794958
degree:2
time used:0.030123; train_score:0.930547, sv_score:0.860465
degree:3
time used:0.137346; train_score:1.000000, sv_score:-104.429619

You can see that a degree of 1 is the same as not using a polynomial above. The score of degree is 3 on the training set is 1, and the score on the test set is negative, which is obviously over-fitting.

Therefore, a model with a degree of 2 should be selected in the end.

The second-order polynomial is much better than the first-order polynomial, but there is still a big gap between the scores on the test set and the training set. This may be the reason for insufficient data. More information is needed to further improve the accuracy of the model.

Comparison of normal equation solution and gradient descent

In addition to the gradient descent method to approximate the optimal solution, the final solution can also be directly calculated using the regular equation solution method.

**According to Wu Enda's course, the optimal solution of linear regression is: **

theta = (X^T * X)^-1 * X^T * y

In fact, the two methods have their own advantages and disadvantages:

Gradient descent method:

Disadvantages: need to choose learning rate, need multiple iterations

Advantages: It can still work at a good speed when there are many eigenvalues (more than 10,000)

Normal equation solution:

Advantages: no need to set learning rate, no need for multiple iterations

Disadvantages: need to calculate the transpose and inverse of X, complexity O3; especially slow when there are many eigenvalues (more than 10,000)

In nonlinear calculations such as classification, the normal equation solution method is not applicable, so the gradient descent method has a wider range of applications.

The above sklearn+python: linear regression case is all the content shared by the editor, I hope to give you a reference.

Recommended Posts

Sklearn+python: linear regression case
Python Data Science: Linear Regression
Python Data Science: Linear Regression Diagnosis