# Logistic regression
## Bird dataset
To briefly switch things up, I want to look at a new dataset giving details
of nearly 2000 bird species.
```{r}
set.seed(1)
x <- read.csv("../../data/avian_ssd_jan07.txt", as.is=TRUE, sep="\t")
x <- x[x$Clutch_size > 0,]
x <- x[x$M_mass > 0,]
x <- x[x$Egg_mass > 0,]
dim(x)
```
As a sanity check, do the birds with the largest egg mass make sense to us?
```{r}
x$English_name[order(x$Egg_mass,decreasing=TRUE)[1:10]]
```
![Emu](img/emu.jpg)
![Wandering Albatross](img/alb.jpg)
![Royal Albatross](img/ralb.jpg)
![Brown Kiwi](img/bkiwi.jpg)
![Emperor Penguin](img/penguin.jpg)
I am interested in understanding what features make a bird have only one
egg in a clutch. We can see that the mass (male) and mass of the egg both
seem to influnce this set:
```{r}
cl <- as.numeric(x$Clutch_size == 1)
y <- cl*2 - 1
plot(x$M_mass[cl == 0], x$Egg_mass[cl == 0], log="xy",
pch=19, cex=0.5, col=grey(0.5,0.5))
points(x$M_mass[cl == 1], x$Egg_mass[cl == 1], col="blue", pch=22)
```
We can use these two variables to fit both a linear and logistic
model for classification:
```{r}
outLm <- lm(y ~ log(M_mass) + log(Egg_mass), data=x)
summary(outLm)
outGlm <- glm(cl ~ log(M_mass) + log(Egg_mass), data=x, family="binomial")
summary(outGlm)
```
Which we can visualize as the following for the linear model:
```{r}
plot(log(x$M_mass[cl == 0]), log(x$Egg_mass[cl == 0]),
pch=19, cex=0.5, col=grey(0.5,0.5))
points(log(x$M_mass[cl == 1]), log(x$Egg_mass[cl == 1]), col="blue", pch=22)
abline(-1 * outLm$coef[1] / outLm$coef[3], -1 * outLm$coef[2] / outLm$coef[3],
col="#6E3179", lty="dashed", lwd=1.5)
```
But wait, this doesn't seem very helpful, right? The problem is that
the classes are unbalanced so we need to actually *move* the plane by some
amount:
```{r}
plot(log(x$M_mass[cl == 0]), log(x$Egg_mass[cl == 0]),
pch=19, cex=0.5, col=grey(0.5,0.5))
points(log(x$M_mass[cl == 1]), log(x$Egg_mass[cl == 1]), col="blue", pch=22)
abline(-1 * outLm$coef[1] / outLm$coef[3] - 2, -1 * outLm$coef[2] / outLm$coef[3],
col="#6E3179", lty="dashed", lwd=1.5)
```
And the GLM looks like this:
```{r}
plot(log(x$M_mass[cl == 0]), log(x$Egg_mass[cl == 0]),
pch=19, cex=0.5, col=grey(0.5,0.5))
points(log(x$M_mass[cl == 1]), log(x$Egg_mass[cl == 1]), col="blue", pch=22)
abline(-1 * outLm$coef[1] / outLm$coef[3] - 2, -1 * outLm$coef[2] / outLm$coef[3],
col="#000000", lty="dashed", lwd=1.5)
abline(-1 * outGlm$coef[1] / outGlm$coef[3] - 0.7, -1 * outGlm$coef[2] / outGlm$coef[3],
col="#6E3179", lty="dashed", lwd=1.5)
```
Which both visually separate the space in a sensible way.