Flowers dataset

Today we will look at an image image dataset. The data consists of photos of flowers. There are 17 types of flowers and the task is to recognize the flower from the image. We will look at just 10 of the classes in the notes today. If you are curious, the original paper describing the data can be found in the following paper:

It was constructed by the Visual Geometry Group (VGG) at Oxford University.

Image data in R

We need to read image data into R in two steps. First, we load the metadata. This is exactly the same as any other dataset we have worked with:

flowers <- read_csv("../data/flowers_17.csv.bz2")
flowers
## # A tibble: 1,360 × 4
##    obs_id    train_id class class_name
##    <chr>     <chr>    <dbl> <chr>     
##  1 id_000001 valid        4 crocus    
##  2 id_000002 train        6 tigerlily 
##  3 id_000003 valid        9 sunflower 
##  4 id_000004 train       15 windflower
##  5 id_000005 train        3 bluebell  
##  6 id_000006 train       15 windflower
##  7 id_000007 train       15 windflower
##  8 id_000008 train       11 colts foot
##  9 id_000009 valid       10 daisy     
## 10 id_000010 train        1 snowdrop  
## # … with 1,350 more rows

Reading in the images themselves is similar to the process of having an annotation table in addition to the document table. However, unlike text, the “raw” data is not stored directly in the docs as well. Usually, the images will be individual image files on your computer. We can read a single image into R using the commands readPNG or readJPEG. For example, here is one of the flowers:

library(png)
img_data <- readPNG("../data/flower.png")

What is this object? It is a something similar to a matrix called an array:

class(img_data)
## [1] "array"

However, unlike a matrix that only has two dimensions (number of rows and number of columns) an array can have an arbitrary number of dimensions. We can show the image in R using the following:

par(mar = c(0,0,0,0))
plot(0, 0, xlim=c(0,1), ylim=c(0,1), axes= FALSE, type = "n")
rasterImage(img_data, 0, 0, 1, 1)

The array object in R has three dimensions:

dim(img_data)
## [1] 64 64  3

These correspond to an image that is 64 by 64 pixels wide and has three color channels (red, green, and blue). Each number in the array contains a number between 0 and 1 that tells us how much each of the three colors are represented in the image. For example, here are the first five rows and columns of the image:

img_data[1:5, 1:5,]
## , , 1
## 
##            [,1]       [,2]       [,3]       [,4]       [,5]
## [1,] 0.18431373 0.41176471 0.05098039 0.04705882 0.03921569
## [2,] 0.09803922 0.58039216 0.05490196 0.03529412 0.04313725
## [3,] 0.06274510 0.44705882 0.20000000 0.05098039 0.07450980
## [4,] 0.10980392 0.01960784 0.60784314 0.16078431 0.12156863
## [5,] 0.17254902 0.17254902 0.49803922 0.12941176 0.24705882
## 
## , , 2
## 
##            [,1]       [,2]       [,3]       [,4]       [,5]
## [1,] 0.20784314 0.43921569 0.04313725 0.04705882 0.03921569
## [2,] 0.09803922 0.62352941 0.07450980 0.05098039 0.04705882
## [3,] 0.05882353 0.47843137 0.20784314 0.05098039 0.07450980
## [4,] 0.12941176 0.03529412 0.62745098 0.15686275 0.12549020
## [5,] 0.18823529 0.19215686 0.53725490 0.15294118 0.25098039
## 
## , , 3
## 
##            [,1]       [,2]       [,3]       [,4]       [,5]
## [1,] 0.19607843 0.36078431 0.05098039 0.01960784 0.03529412
## [2,] 0.08235294 0.54901961 0.02745098 0.01568627 0.02352941
## [3,] 0.04705882 0.44313725 0.16862745 0.05882353 0.07450980
## [4,] 0.12941176 0.02352941 0.56862745 0.14117647 0.10196078
## [5,] 0.19215686 0.17647059 0.50196078 0.09803922 0.22745098

To get all of the data, we could write a loop and load each image into R. Another option that is common when working with a dataset like this created for training machine learning algorithms is to read the entire dataset from a large binary file. Here are all of the images in the dataset as a single array?

x64 <- read_rds("../data/flowers_17_x64.rds")

What is this object? It’s another array:

class(x64)
## [1] "array"

However, this array has four dimensions instead of three:

dim(x64)
## [1] 1360   64   64    3

What these mean are:

We will explore this data more through today’s notes.

Data subset

I only want to look at the first 10 types of flowers in today’s notes.

x64 <- x64[flowers$class %in% 0:9,,,]
flowers <- flowers[flowers$class %in% 0:9,]
fnames <- flowers$class_name[match(0:9, flowers$class)]
fnames <- factor(fnames, levels = fnames)

Class “1” consists of the snowdrop flower. Let’s see at a few of the images to get a sense of what this flower looks like:

par(mar = c(0,0,0,0))
par(mfrow = c(3, 4))
set.seed(1)
for (i in sample(which(flowers$class == 1), 12)) {
  plot(0,0,xlim=c(0,1),ylim=c(0,1),axes= FALSE,type = "n")
  rasterImage(x64[i,,,],0,0,1,1)
}

We can look at these and instantly see the similarities. The difficulty is going to be teaching the computer to understand these as well. Let’s now look at a representative flower from all 10 classes:

par(mar = c(0,0,0,0))
par(mfrow = c(3, 4))
set.seed(1)
for (i in 0:9) {
  plot(0,0,xlim=c(0,1),ylim=c(0,1),axes= FALSE,type = "n")
  j <- sample(which(flowers$class == i), 1)
  rasterImage(x64[j,,,],0,0,1,1)
  text(0.5, 0.1, flowers$class_name[j], cex = 3, col = "salmon")
}

Notice that color will be useful for telling some of these apart, but not sufficent for distinguishing all classes. Crocus’ and irises, for example, have very similar colors.

Collapse into data matrix

To start, we use a trick to flatten the dataset from an array into a matrix. This just unravels the dataset.

X <- t(apply(x64, 1, cbind))
y <- flowers$class

X_train <- X[flowers$train_id == "train",]
y_train <- y[flowers$train_id == "train"]
dim(X_train)
## [1]   600 12288

Why does the matrix X have so many columns? It has 64*64*3 values, which yields 12288 total variables. Notice how quickly the number of components in an image becomes. If we had an HD image this would require 2.7 million values!

\[ 1\;280 * 720 * 3 = 2\;764\;800\]

Let’s apply the elastic net to our data (it is really the only technique we know that can handle so many variables):

model <- cv.glmnet(X_train, y_train, family = "multinomial", nfolds = 3)
beta <- coef(model, s = model$lambda.1se)
beta <- Reduce(cbind, beta)

dim(beta[apply(beta != 0, 1, any),])
## [1] 473  10

The resulting model, even at lambda.1se, has near 500 non-zero components. Evaluating the model we see that it overfits to the training data:

pred <- predict(model, newx = X, type = "class")
tapply(pred == y, flowers$train_id, mean)
## train valid 
## 0.905 0.420

There are ten classes here, so a 42% classification rate is actually quite good. I think we can probably do slightly better though! A confusion matrix is also instructive:

table(pred[flowers$train_id == "valid"], y[flowers$train_id == "valid"])
##    
##      0  1  2  3  4  5  6  7  8  9
##   0 12  0  1  0  0  1  0  2  1  0
##   1  2  5  1  3  2  1  0  1  1  0
##   2  0  2  9  9  0  0  4  4  0  0
##   3  0  5  3  2  3  2  1  3  1  0
##   4  1  4  0  2 12  5  2  1  5  0
##   5  1  2  0  1  0  9  0  1  1  0
##   6  1  0  0  0  0  1  7  2  2  2
##   7  3  1  6  1  0  0  2  4  1  1
##   8  0  0  0  2  3  1  3  1  8  1
##   9  0  1  0  0  0  0  1  1  0 16

Do some of the most confused classes correspond to what you would expect from the example images?

HSV Space

One difficulty with using the red, green, and blue pixel values is that these do not map very well into a “natural” meaning of color. They are useful for digital screens to display images but not ideal for much else.

Instead, we can use a different color space model that translates red, green, and blue into a different set of variables. One popular choice in statistical learning is the hue, saturation, value space. These three values range from 0 to 10. A good picture helps a lot to understand what the values mean:

Value tells how dark a pixel is, saturation how much color it has (with a low value being close to grey), and hue gives the specific point on a color wheel. Usually a hue of 0 indicates red. Notice that hue is a circular variable, so that a hue of 0.99 is close to a hue of 0.01.

We can convert into HSV space with the base R function rgb2hsv:

i <- 3
red <- as.numeric(x64[i,,,1])
green <- as.numeric(x64[i,,,2])
blue <- as.numeric(x64[i,,,3])
hsv <- t(rgb2hsv(red, green, blue, maxColorValue = 1))
head(hsv)
##              h         s         v
## [1,] 0.5681818 0.3235306 0.2666657
## [2,] 0.5681818 0.2404577 0.3587929
## [3,] 0.5733333 0.2531846 0.3872243
## [4,] 0.5733333 0.2492254 0.3933757
## [5,] 0.5802469 0.2573864 0.4113750
## [6,] 0.5808081 0.3063756 0.4223958

To make sure we understand exactly what these values mean, let’s plot some values in R. The hsv function maps a set of HSV coordinates into a name of the color. Here we look at a bunch of grey colors with varying values (hue is set to 1 and saturation to zero), as well as a set of 10 of hues where saturation and value are set at 1.

color_vals <- c(hsv(1, 0, seq(0, 1, by = 0.2)),
                hsv(seq(0, 0.9, by = 0.1), 1, 1))
df_cv <- tibble(color_vals = color_vals, num = seq_along(color_vals))

df_cv %>%
  ggplot(aes(num, num)) +
    geom_point(aes(fill = color_vals), color = "black", size = 12, shape = 21) +
    scale_fill_identity() +
    theme_void()

A good trick with HSV space is to place each pixels into a small set of fixed colors. We will start by using the buckets defined in the previous plot.

We will do that we creating a vector called color set to #000000 (pure black) and then changing the color depending on the HSV coordinates. If the saturation is less than 0.2 this is a pixel too washed out to make out a reasonable color. We then set it to a shade of grey depending on value split into five buckets. If the saturation is higher than 0.2 and value is higher than 0.2 (i.e., it is not too dark), we bucket the hue into ten buckets. Points with a low value are all kept at the default of black.

color <- rep("#000000", nrow(hsv))

index <- which(hsv[,2] < 0.2)
color[index] <- hsv(1, 0, round(hsv[index,2] * 5) / 5)

index <- which(hsv[,2] > 0.2 & hsv[,3] > 0.2)
color[index] <- hsv(round(hsv[index,1],1), 1, 1)

table(factor(color, levels = color_vals))
## 
## #000000 #333333 #666666 #999999 #CCCCCC #FFFFFF #FF0000 #FF9900 #CCFF00 
##     204     260       0       0       0       0      13    2870      83 
## #33FF00 #00FF66 #00FFFF #0066FF #3300FF #CC00FF #FF0099 
##      28       3      22     611       2       0       0

For the one test image, we see that the dominant color is “#FF9900”, an orange, followed by “#0066FF”, a blue.

We can use these counts as features to tell us about a given flower. Let’s cycle over the entire dataset and grab these features.

X_hsv <- matrix(0, ncol = length(color_vals),
                   nrow = nrow(flowers))
for (i in seq_len(nrow(flowers))) {
  red <- as.numeric(x64[i,,,1])
  green <- as.numeric(x64[i,,,2])
  blue <- as.numeric(x64[i,,,3])
  hsv <- t(rgb2hsv(red, green, blue, maxColorValue = 1))

  color <- rep("#000000", nrow(hsv))

  index <- which(hsv[,2] < 0.2)
  color[index] <- hsv(1, 0, round(hsv[index,3] * 5) / 5)

  index <- which(hsv[,2] > 0.2 & hsv[,3] > 0.2)
  color[index] <- hsv(round(hsv[index,1],1), 1, 1)

  X_hsv[i,] <- table(factor(color, levels = color_vals))
}
head(X_hsv)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,]  435  191  283  207  280  175  100  579  752   105     0     0     0
## [2,]    0  370  731  135    7   19  957  611   42   165   972    51    11
## [3,]  124   32  160   13  128    7   13 2870   83    28     3    22   611
## [4,]   94   42  299  552  223  117   12  245 1754    14     0     0     1
## [5,] 2825  232   32   41  245  144    2  188  214     8     1     0     4
## [6,]  415  114  222  193  180  434  135  121  305   903   259    73   152
##      [,14] [,15] [,16]
## [1,]    35   832   122
## [2,]     5     1    19
## [3,]     2     0     0
## [4,]   404   335     4
## [5,]    25    88    47
## [6,]   486    36    68

The 8th column is the orange color and the 9th a greenish color, both popular from the flowers and the background greenery.

We can use this new matrix to fit another elastic net. The matrix is small enough that we could use other techniques too, but I’ll keep it consistent here.

y <- flowers$class

X_train <- X_hsv[flowers$train_id == "train",]
X_valid <- X_hsv[flowers$train_id == "valid",]
y_train <- y[flowers$train_id == "train"]
y_valid <- y[flowers$train_id == "valid"]

library(glmnet)
model <- cv.glmnet(X_train, y_train, family = "multinomial")
beta <- coef(model, s = model$lambda.1se)
beta <- Reduce(cbind, beta)
colnames(beta) <- fnames
rownames(beta) <- c("(intercept)", color_vals)
as.matrix(beta)
##                  daffodil      snowdrop   lily valley      bluebell
## (intercept)  0.5227351876 -2.663736e-01  1.380739e+00  1.715815e-01
## #000000      0.0000000000  2.738252e-04  9.556192e-05 -1.043476e-04
## #333333      0.0010117769  2.503069e-04 -1.676635e-03 -1.467822e-03
## #666666      0.0000000000  0.000000e+00  6.796344e-06  0.000000e+00
## #999999      0.0000000000  1.007637e-03 -2.798258e-04  0.000000e+00
## #CCCCCC     -0.0033169647  1.413444e-03  0.000000e+00  0.000000e+00
## #FFFFFF     -0.0014928966  9.810528e-04  0.000000e+00  0.000000e+00
## #FF0000     -0.0005787812  0.000000e+00  0.000000e+00  0.000000e+00
## #FF9900      0.0004925621 -5.694848e-05 -1.931152e-03 -9.271120e-04
## #CCFF00      0.0000000000 -5.829491e-05  0.000000e+00  0.000000e+00
## #33FF00     -0.0001178432  0.000000e+00  2.810639e-04  0.000000e+00
## #00FF66      0.0000000000  0.000000e+00  0.000000e+00  2.660617e-05
## #00FFFF      0.0005923406  4.864185e-04  1.375141e-04  0.000000e+00
## #0066FF      0.0001660562  0.000000e+00 -8.963539e-04  0.000000e+00
## #3300FF      0.0000000000  0.000000e+00 -1.860959e-04  3.939008e-03
## #CC00FF      0.0000000000  0.000000e+00  0.000000e+00  2.465911e-03
## #FF0099      0.0000000000  0.000000e+00  0.000000e+00  0.000000e+00
##                    crocus          iris     tigerlily         tulip
## (intercept)  3.469268e-01 -5.963652e-01 -0.3431818995  8.379175e-01
## #000000      0.000000e+00 -1.717496e-04  0.0000000000  1.393792e-04
## #333333      0.000000e+00 -3.790566e-03  0.0000000000  0.000000e+00
## #666666      0.000000e+00 -1.487313e-03 -0.0001587744  4.734441e-04
## #999999      5.578828e-04 -8.695910e-04 -0.0004614850  0.000000e+00
## #CCCCCC      1.276358e-03  2.366705e-03  0.0000000000 -1.102783e-03
## #FFFFFF      2.106243e-04 -8.670584e-05  0.0001790127 -2.598811e-03
## #FF0000      0.000000e+00  2.077320e-03  0.0030597873 -6.851298e-04
## #FF9900      1.923474e-04  0.000000e+00  0.0000000000  6.424913e-05
## #CCFF00     -7.431796e-04  4.887551e-04  0.0000000000  0.000000e+00
## #33FF00     -6.524823e-04  0.000000e+00  0.0003232092  0.000000e+00
## #00FF66     -1.246369e-03  0.000000e+00  0.0000000000  0.000000e+00
## #00FFFF     -1.181340e-03  0.000000e+00  0.0000000000 -4.141706e-04
## #0066FF     -4.519156e-05  0.000000e+00  0.0006873417 -2.217240e-04
## #3300FF      2.427662e-03  3.054449e-03  0.0000000000  0.000000e+00
## #CC00FF      3.683836e-03  2.497049e-03  0.0000000000  0.000000e+00
## #FF0099      0.000000e+00  0.000000e+00 -0.0003807002  0.000000e+00
##                fritillary     sunflower
## (intercept) -7.374753e-01 -1.3165046069
## #000000      0.000000e+00 -0.0001802347
## #333333      0.000000e+00  0.0000000000
## #666666      0.000000e+00  0.0000000000
## #999999      0.000000e+00  0.0000000000
## #CCCCCC      0.000000e+00  0.0000000000
## #FFFFFF      0.000000e+00  0.0000000000
## #FF0000      1.456720e-03  0.0000000000
## #FF9900     -2.502337e-04  0.0014845994
## #CCFF00      0.000000e+00  0.0000000000
## #33FF00      5.404933e-05  0.0000000000
## #00FF66      0.000000e+00  0.0000000000
## #00FFFF     -6.556300e-04  0.0009658046
## #0066FF      0.000000e+00  0.0012999982
## #3300FF      0.000000e+00  0.0000000000
## #CC00FF      0.000000e+00  0.0000000000
## #FF0099      9.430848e-03  0.0000000000

We see that some nice patterns here. Snowdrops have a large coefficient for light grey (“#CCCCCC” and “#999999”) and white (“#FFFFFF”) and bluebells have a large value for blue (“#3300FF”) and purple (“#CC00FF”). Sunflowers have a large coefficient for orange (“#FF9900”).

This model is slightly more predictive and not nearly as overfit.

pred <- as.numeric(predict(model, newx = X_hsv,
                           type = "class"))

tapply(pred == y, flowers$train_id, mean)
##     train     valid 
## 0.5383333 0.4350000

The reason for this that the first elastic net likely approximated the kind of analysis we did here, but in doing overfit to the way hue, value, and saturation looked on the training data.

More Colors

We can improve our model by including more colors. We don’t need any more greys, but lets include a set of 100 hues. This will give us more information about the particular colors for each flower.

color_vals <- c(hsv(1, 0, seq(0, 1, by = 0.2)),
                hsv(seq(0, 0.99, by = 0.01), 1, 1))

X_hsv <- matrix(0, ncol = length(color_vals),
                   nrow = nrow(flowers))
for (i in seq_len(nrow(flowers))) {
  red <- as.numeric(x64[i,,,1])
  green <- as.numeric(x64[i,,,2])
  blue <- as.numeric(x64[i,,,3])
  hsv <- t(rgb2hsv(red, green, blue, maxColorValue = 1))

  color <- rep("#000000", nrow(hsv))

  index <- which(hsv[,2] < 0.2)
  color[index] <- hsv(1, 0, round(hsv[index,3] * 5) / 5)

  index <- which(hsv[,2] > 0.2 & hsv[,3] > 0.2)
  color[index] <- hsv(round(hsv[index,1], 2), 1, 1)

  X_hsv[i,] <- table(factor(color, levels = color_vals))
}

We will use the elastic net again here. With the increased set of colors, let’s set alpha to 0.2 to spread the weights out over similar colors.

y <- flowers$class

X_train <- X_hsv[flowers$train_id == "train",]
X_valid <- X_hsv[flowers$train_id == "valid",]
y_train <- y[flowers$train_id == "train"]
y_valid <- y[flowers$train_id == "valid"]

library(glmnet)
model <- cv.glmnet(X_train, y_train, family = "multinomial",
                   alpha = 0.2)

The model is more predictive with the more grainular color labels:

pred <- as.numeric(predict(model, newx = X_hsv, type = "class"))
tapply(pred == y, flowers$train_id, mean)
##     train     valid 
## 0.5833333 0.4350000

We can create an interesting visualization of these values by showing the weights as a function of the actual colors for each flower.

beta <- coef(model, s = model$lambda.min)
beta <- as.matrix(Reduce(cbind, beta))[-1,]
colnames(beta) <- fnames
rownames(beta) <- color_vals
df <- tibble(flower = rep(colnames(beta), each = nrow(beta)),
             color = rep(rownames(beta), ncol(beta)),
             beta = as.numeric(beta))
cols <- color_vals; names(cols) <- color_vals
df$color <- factor(df$color, levels = color_vals)
filter(df, beta > 0) %>%
  ggplot(aes(color, flower)) +
    geom_point(aes(color = color, size = beta), show.legend = FALSE) +
    theme_minimal() +
    scale_colour_manual(values = cols) +
    theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.line = element_line(colour = "black"))

Tiger Lily’s are very red, whereas bluebells, crocuses, and irises have a blue/purple color.

The notes here show how we can apply the techniques we have seen in class to the raw pixel intensities in the flowers. This works because flowers can be largely described through color and the images themselves are fairly small. Note though that we did a better job by hand constructing features that capture exactly what we want. In this case, it was the count of different hues and greys. We know that the flowers can be distinguished reasonably well with this kind of technique because each flower has a relatively unique flower color. What is different from text though, is that while word counts work well as at least a good first pass for almost all text processing tasks, there is not uniform set of features that work for all image processing tasks. We’ll return to this idea in the second set of notes.