LMMpro Lesson 2: How linear regressions work.

Objective: Learn the basic principles of linear regressions.
Focus Group: Langmuir Equation.
Contributed by: C.P. Schulthess (3 March 2008), University of Connecticut, USA.
Requirement: Demo Mode version of LMMpro
Difficulty level: Medium (basic calculus required).
Copyright © 2008 by Alfisol.

Linear regressions are commonly used in all scientific and engineering fields. The objective is to find the best linear fit to a set of data on a graph. Since the equation of a line is y = mx + b , the objective is basically to find the best m and best b values.

The first thing that you should understand about linear regressions is how is "best" defined. Pause and think about this a little bit on your own before you continue reading.

Well, I hope your answer was "the best line is that line that is closest to the data points" or "the best line is that line that has the least error." Minimizing the error is the primary objective of linear regression techniques. But now the question becomes, "How do you minimize the error?" Work through each of the steps below and see if you can find the answer to this question. Just remember, Carl Friedrich Gauss solved it when he was 17 years old in 1795. Granted, he was a genius.

  1. Given a line y = mx + b that predicts the data, for every data point (x, y) what is the error of the prediction made by the line?
    Let ε = sum of all the errors; let εi = error of the prediction for the ith data point; let yi = y value of ith data point; and let yp = predicted y value when x = xi.

    (A) εi = yi - yp.
    (B) |εi| = ABS (yi - yp).
    (C) ε2i = (yi - yp)2.
    (D) ε4i = (yi - yp)4.

    Any of your choices above will be correct, but only one will turn out to be much more practical. Choice (A) is certainly correct. The sum of all the errors for all of the data points will be equal to zero if the line chosen is indeed the best line through the data. This was the method used prior to 1795, which is when Gauss derived the method of linear regression, or perhaps prior to 1805, which is when Legendre first published the method of linear regression. We will not use choice (A) because it is not easy to minimize it. Using this definition, a plot of the error as a function of the various values for m and b goes from -∞ to +∞, and locating where it crosses zero can only be done by trial-and-error (no pun intended).

    Choice (B) is also corrrect. As with (A), the sum of all the errors for all of the data points will be equal to zero if the line chosen is indeed the best line through the data. This time, however, a plot of the error as a function of the various values for m and b will have a minimum, and this minimum will coincide with zero. Notice that you have to manually evaluate the absolute value of the error for each data point, and this means that locating where the minimum occurs can only be done by trial-and-error (once again, no pun intended). Consequently, this is not a practical answer.

    Choice (C) is also correct. The sum of all the errors for all of the data points will be equal to zero if the line chosen is indeed the best line through the data. Note, however, that the sum of all the errors squared for all of the data points will NOT be equal to zero. Instead, it will be equal to a minimum value. Choice (C) turns out to be the most practical answer simply because we can locate the minimum without any trial-and-error guesses or without any concern with the error being positive or negative. Notice that the square of the error is always positive regardless of what the sign was prior to squaring it. The criteria for minimizing a linear regression is to minimize the square of the error terms.

    Choice (D) is also correct. It is as effective and almost as practical as choice (C). It only has one minor inconvenience, which should be obvious to you. Namely, calculating the fourth power of a number is much more troublesome than calculating the square of a number. Why do more work than you have to? Other than that, it's perfect.

  2. Now for the hard part: finding the minimum. Actually, we'll do that on step 3. For now, let's make sure we know what the error function is. That is, let's write the equation that we seek to minimize?

    Substitute the equation of the line for yp above. Namely, let yp = mxi + b. Let ε2 be the sum of all the errors from i=1 to i=n.

  3. Calculus is an excellent tool here. From calculus we know that the first derivative of a line gives us another line whose value for any value of x is equal to the tangent on the original line at x. We seek to find the minimum value of the line expressed in step 2 above. If the value of m and the value of b are perfect, then the value of ε2 will be at a minimum and its tangent at that point (that is, its first derivative at that point) will be equal to zero. This is because the slope of the curve at that point will be a perfectly horizontal.

    Accordingly, we seek to find the first derivative of the line expressed in step 2 above as a function of the change in m and b. Namely, complete the following two first derivatives:

    d(ε2)/dm = . . .

    d(ε2)/db = . . .

    Note, with d(ε2)/dm you seek to find the equation that gives the slope of the error-square function as a function of the value of m chosen. You seek the value of m where this slope is equal to zero; that is, where it is horizontal. Similarly, with d(ε2)/db you seek to find the equation that gives the slope of the error-square function as a function of the value of b chosen. You seek the value of b where this slope is equal to zero; that is, where it is horizontal.

    Also note that it does not matter that you do not yet know the correct value of m or b. There is no guessing involved. You only need to know that the first-derivative will be zero when these values are correct.

  4. Now for the tricky algebra. Set your equation of d(ε2)/dm equal to zero. Similarly, set your equation of d(ε2)/db equal to zero. You now have two equations and two unknowns. Solve for m and b. Your answer will be two equations, one that solves for m and the other that solves for b, and these will be the best m and b values for the line that fits the data with the least amount of error.

    Believe it or not, in math we love to keep things simple. Accordingly, it is sometimes customary to simplify the notation as follows:

    Let Sx = Σ xi
    Let Sy = Σ yi
    Let Sxy = Σ (xi yi)
    Let Sxx = Σ (xi2)
    Let nb = Σ b

    When you complete this step you will have the best values of m and b for the line y = mx + b that predicts the data. You know the data values because they are your data. You also know the number of points involved (n), as well as the values for Sx, Sy, Sxy, and Sxx. With all of these known values, you use your final two equations to get the best values of m and b. No guessing was involved, no trial-and-error was involved, and that was the whole point of doing it this way.


  5. The questions below illustrate how to use the linear regression method just derived above.

  6. The Langmuir equation is used for adsorption isotherms. The data are amount of ions adsorbed (Γ) as a function of the concentration of ions in solution at equilibrium (c). The equation is as follows:
    Γ = Γmax K c
    1 + K c
    where the values K and Γmax are constants. It was proposed by Langmuir in 1916. The K and Γmax constants are what we wish to determine so that we get the least error between the predicted values (using the equation above) and the actual data collected. Notice, however, that the Langmuir equation is a parabolic curve. It is not a straight line. A straight line has the form y = mx + b. What now?

    The practical answer is to transform the Langmuir equation into a linear form. If we accomplish this, then we can use all our knowledge of how to get the best m and b values. The best m and b values would, in turn, tell us what the best K and Γmax values happen to be.

    (A) Show how to manipulate the Langmuir equation into the following linear equation:

    1
    Γ
    = 1
    Γmax
    + 1
    Γmax K c
    which is known as the Lineweaver-Burk equation.

    (B) In y = mx + b, what is y, m, x, and b in the the Lineweaver-Burk linear equation shown above?

    (C) If you know the best m and b values for the Lineweaver-Burk equation, what are the best K and Γmax values for the Langmuir equation?

  7. (A) Show how to manipulate the Langmuir equation into the following linear equation:
    Γ = Γmax - Γ
    K c
    which is known as the Eadie-Hofstee equation.

    (B) In y = mx + b, what is y, m, x, and b in the the Eadie-Hofstee linear equation shown above?

    (C) If you know the best m and b values for the Eadie-Hofstee equation, what are the best K and Γmax values for the Langmuir equation?

  8. (A) Show how to manipulate the Langmuir equation into the following linear equation:
    Γ
    c
    = K Γmax - K Γ
    which is known as the Scatchard equation.

    (B) In y = mx + b, what is y, m, x, and b in the the Scatchard linear equation shown above?

    (C) If you know the best m and b values for the Scatchard equation, what are the best K and Γmax values for the Langmuir equation?

  9. (A) Show how to manipulate the Langmuir equation into the following linear equation:
    c
    Γ
    = c
    Γmax
    + 1
    K Γmax
    which is known as the Langmuir linear equation. The names are similar because Langmuir also proposed this linear form of his parabolic equation in 1918.

    (B) In y = mx + b, what is y, m, x, and b in the the Langmuir linear equation shown above?

    (C) If you know the best m and b values for the Langmuir linear equation, what are the best K and Γmax values for the original Langmuir equation?

  10. You can review what these four linear regressions of the Langmuir equation look like using the LMMpro Langmuir software distributed by Alfisol (www.alfisol.com). Review the embedded demo data sets 1, 2 and 3. Study what the m and b values are with each of the linear regressions shown and with each of the data sets available.