Monthly Archives: December 2016

Logistic Regression by Scikit

Open Boundary

Assume there are 3 lines separating 3 areas:
line1: y=x , (x <= 10)
line2: y=20-x,  (x>=10)
line3: x=10, y>=10
logisticr_initial_area

For each area, there are some data:
A:
[-3, -2, 0]
[-3, 1, 0]
[0, 40, 0]
[8, 90, 0]
[8, 10, 0]
[3, 40, 0]

B:
[11, 11, 1]
[15, 6, 1]
[13, 10, 1]
[13, 19, 1]
[23, -1, 1]
[25, -2, 1]

C:
[-2, -3, 2]
[3, 2, 2]
[10, 9, 2]
[15, 10, 2 ]
[20, -1, 2]
[25, -10, 2]

Run below code, we expect that [10, 5] will output value 2. Because point (10, 5) belongs to area B, which has value 2.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets

X = np.array([
    [-3, -2],
    [-3, 1],
    [0, 4],
    [8, 9],
    [8, 10],
    [3, 4],
    [11, 11],
    [15, 6],
    [13, 10],
    [13, 19],
    [23, -1],
    [25, -2],
    [-2, -3],
    [3, 2],
    [10, 9],
    [15, 10],
    [20, -1],
    [25, -10]
])

Y = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2])

logreg = linear_model.LogisticRegression(C=1e5)

# we create an instance of Neighbours Classifier and fit the data.
logreg.fit(X, Y)

print logreg.predict(np.array([[10, 5]]))

Run below code, it outputs the boundary and region:

h = .02  # step size in the mesh

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1, figsize=(4, 3))
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=Y, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())

plt.show()

logisticr_initial_area2

Enclosed Boundary

Suppose we have A, B parts
A: x^2 + y^2 <=2
B: x^2 + y^2 > 2

polynomial_circle

To solve this, we can use Polynomial Logistic Regression method. This is similar to Polynomial Linear Regression.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model

# Area A: x^2 + y^2 <= 2
# Area B: x^2 + y^2 > 2
X = np.array([
    # training set for A
    [0, 0],
    [1, 0],
    [1.5, -0.2],
    [1, 1.5],
    [-1, -1.5],
    [-1.9, 0],
    [-0.5, -1],
    [0, 1.9],
    [0, -1.9],
    [-1.9, 0],
    [-1.5, 0.5],
    # training set for B
    [2, 2.5],
    [2, 3],
    [1, 5],
    [1, -5],
    [-3, 0],
    [-2, -0.1],
    [-2, 1],
    [0, -2.1],
    [0, 2.1],
    [-2.1, 0],
    [-0.6, -2]
])

poly = PolynomialFeatures(degree=2)
X_trans = poly.fit_transform(X)

Y = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

logreg = linear_model.LogisticRegression(C=1e6)

# we create an instance of Neighbours Classifier and fit the data.
logreg.fit(X_trans, Y)

print logreg.coef_
print logreg.intercept_
print logreg.predict(np.array(poly.fit_transform([[0, 1.8]])))  # expect to return 0
print logreg.predict(np.array(poly.fit_transform([[0, -1.8]])))  # expect to return 0
print logreg.predict(np.array(poly.fit_transform([[0, -2.1]])))  # expect to return 1

And the result is:
logisticr_regression_circle2

Accordingly, we can build the hypothesis:
logisticr_regression_circle4

logisticr_regression_circle3

Check my code on github: open boundary, enclosed boundary

 

Logistic Regression

Logistic Regression answer the YES/NO question. For example, giving a set of size of tumor, it answers if it is a tumor. Giving height and weight of a person, answer if it is a man.

Hypothesis

We have hypothesis function logistic_regression1, and it ranges  logistic_regression2. And we define the answer is YES when hypothesis is greater than 0.5, else is NO.
logistic_regression3.

First, let’s take a look at the shape of logistic function(sigmoid function) g(z).
logistic_regression7

It is like:
logistic_regression8

We found that this function can satisfy our need. g(z) ranges from (0, 1).
We have z>=0  –>  g(z)>=0.5  –>  predict y=1,
And we have  z<0  –>  g(z)<0.5  –> predict y=0

Having chose the logistic function, let’s define hypothesis for logistic regression:
logistic_regression4

Let’s do some analysis. When logistic_regression5, we have logistic_regression6.
For this hypothesis, we know when 1+x1-x2>=0, hypothesis is equal to or greater than 0.5, prediction is YES; else hypothesis is less than 0.5, prediction is NO. Take a look at below picture, we know when the point is from region A, the answer is YES; when the point is from region B, the answer is NO.

logistic_regression9

Another example, when logistic_regression10, we have logistic_regression11.
For this hypothesis, we know when x1^2 + x2^2>=1, hypothesis is equal to or greater than 0.5, prediction is YES; else hypothesis is less than 0.5, prediction is NO. Take a look at below picture, we know when the point is from region A, the answer is YES; when the point is from region B, the answer is NO.
logistic_regression12

Cost function

Cost function for logistic regression is:
logistic_regression13
For y=1 case, if the hypothesis is is close to 0, we know the cost function will be close to positive infinity. If hypothesis is close to 1, then cost function will be close to 0.
For y=0 case, if the hypothesis is is close to 0, we know the cost function will be close to 0. If hypothesis is close to 1, then cost function will be close to positive infinity.

Furthermore, cost function can be rewritten as:

logistic_regression14

And J of theta can be written as:
logistic_regression15

Since we have the cost function, we can solve the logistic regression by gradient descent, repeating below loop until theta becomes to a value which is small enough.
logistic_regression16

Or logistic regression can be solved by some optimization algorithms(Conjugate gradient, BFGS or L-BFGS).

Learning materials: Andrew NG Machine Learning course.

 

Polynomial Linear Regression by Scikit

Suppose we have a 2-degree polynomial function polynomial1

And let’s generate some training data sets. First, let’s have some random (x1, x2) sets:
polynomial2

Then, we transform it to (1, x1, x2, x1x2, x1^2, x2^2) form:
polynomial3

For polynomial1, we know the coefficients is [2, 1, -3, 2, -5, 6]. Then we will have training result:
polynomial4

Let’s make it a little randomly, and we have training result set:
polynomial5

So now our training set is:
polynomial6

Now, we can solve Xtrans, y as how we solve multivariate linear regression:

regr = linear_model.LinearRegression()
regr.fit(X_trans, y)
print regr.coef_    # slope
print regr.intercept_   # intercept

And we got the result, it is very similar to the coefficient we set [2, 1, -3, 2, -5, 6]
polynomial7

Check my code on github.

Multivariate Linear Regression by Scikit

Suppose we have training sets for y = 3 + 2 * x1 + x2, run below code to find the coefficient and intercept.

import numpy as np
from sklearn import datasets, linear_model

# z = 3 + 2 * x1 + x2
X = np.array([
    [3, 0],
    [0, 3],
    [1, 1],
    [2, 3],
    [4, 1]
])
z = np.array([
    [11 + 0.5],
    [5 + 0.7],
    [6 - 0.3],
    [11 + 0.2],
    [15 - 0.8]
])

# Plot outputs
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(X, z)

print regr.coef_    # slope
print regr.intercept_   # interceptor

My code on github: LR 2 features

Simple Linear Regression by Scikit

Suppose we have training set for y = 5 + 0.5 * x. Run below code we can find the coefficient(slope) and intercept

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model

# y = 5 + 0.5 * x
x = [-10.0, -9.5, -9.0, -8.5, -8.0, -7.5, -7.0, -6.5, -6.0, -5.5, -5.0, -4.5, -4.0, -3.5, -3.0, -2.5, -2.0, -1.5, -1.0, -0.5, 0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5, 10.0]
y = [-1.7776247, 0.9721851, 2.8311658, -1.7664247, 2.0148739, 1.2848472, 4.9397416, 2.3675736, 3.7809385, 3.9241124, 1.8500219, 4.3446124, 4.7043888, 4.0020450, -1.7093026, 2.4250200, 8.3738445, 7.4809320, 4.7741648, 5.1719583, 6.5781468, 4.5257286, 4.7368501, 6.5144324, 7.0737169, 2.5072875, 5.7161262, 8.8050798, 7.9458498, 9.9034171, 6.6390707, 7.7411975, 6.1857891, 8.1798865, 11.2942560, 11.3548412, 8.6652006, 7.9644377, 10.0735000, 8.5496633, 7.5857769]
a = np.array(x).reshape(x.__len__(), 1)
b = np.array(y).reshape(y.__len__(), 1)

# Plot outputs
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(a, b)

print regr.coef_    # slope
print regr.intercept_   # intercept

My code on github: LR 1 feature