# Machine Learning: Linear Regression Example: Concrete

There is a fun archive of machine learning data sets maintained by UC Irvine. For a concrete example, let's take the Concrete Compressive Strength data set and try linear regression on it. (Get it? Concrete? Ha ha ha!) There are 1030 points in the data set, eight input features, and one output feature. Here is the basic info:

Feature #

Name Range
1 Amount of Cement (kg/m3) 102 - 540
2 Amount of Blast Furnace Slag (kg/m3) 0 - 359.4
3 Amount of Fly Ash (kg/m3) 0 - 200.1
4 Amount of Water (kg/m3) 121.75 - 247
5 Amount of Superplasticizer (kg/m3) 0 - 32.2
6 Amount of Coarse Aggregate (kg/m3) 801 - 1145
7 Amount of Fine Aggregate (kg/m3) 594 - 992.6
8 Mixture Age (days) 1 - 365
Output Compressive Strength (MPa) 2.3 - 82.6

In keeping with the principle that the ranges of the features should be scaled to the range (-1, 1), we will subtract the midpoint of each range from each feature and divide by the new maximum. So, for example, the midpoint of feature 1 is 321, so subtracting brings the range to (-219, 219), and dividing by 219 brings the range to (-1, 1).

Here's a plot of feature 1 versus the output. There's a lot of variation, but it does sort of look roughly correlated.

Here's the Octave code: concrete_regression.m. You'll also need to open Concrete_Data.xls and export to CSV to Concrete_Data.csv so that Octave can read the file. Place both Octave and CSV files in the same directory, change to that directory, run Octave, and then call concrete_regression().

Here is the learning curve and the parameters found. The training cost is in blue, while the cross-validation cost is in red.

\begin{align*} h_\theta &= 0.313 + 0.362x_1 + 0.138x_2 + 0.011x_3 -0.365x_4\\ &+ 0.182x_5 -0.111x_6 -0.190x_7 +0.361x_8 \end{align*}

The training and cross-validation costs are very close to each other, which is good. It means that the learned parameters are quite representative of the entire data set, so there is no overfitting.

However, the cost appears to be quite high: about 0.037. This means that the output is, on average, off by 0.27. Which is, by nearly any standard, terrible. Clearly there is some intense underfitting going on, and the only remedy is to get more features and more parameters.

But we could combine the features in infinite ways. How are we going to find good ways to combine the features? We'll take a look at neural networks in the next article, which essentially combines the features for us, and gives us more parameters as well.