Class 01: Introduction to Statistical Learning
Welcome! All of the material for this course, including the notes here, are available on our class website:
The website has a detailed syllabus; today I will just cover some of the most important aspects of it. At the end of class today, I’ll finish by explaining how to set up R and GitHub for the next class.
What Is “Learning”?
Statistical learning, synonymous with machine learning, is the process of extracting knowledge from data automatically, usually with the goal of making predictions on new, unseen data.
A classical example is a spam filter, for which the user keeps labelling incoming mails as either spam or not spam. A machine learning algorithm then “learns” a predictive model from data that distinguishes spam from normal emails, a model which can predict for new emails whether they are spam or not.
Here are some explicit examples that we will see:
- using physical characteristics of animals to predict whether they are carnivores
- estimate how much a house is worth given properties such as number of bedrooms, square footage and its address
- predict whether a flight will be delayed given the carrier, scheduled departure time, arrival and departure airports
- a crime has been reported at a specific place and time in Chicago; what type of crime is it?
- here is a picture of a flower, what kind of flower is it?
- given two sentences of text, predict which President used it in a public speech
- how many page views will a specific Wikipedia page receive tomorrow?
And can look at some explicit examples of these models in the wild:
This class is not about teaching you a wide range of models. A common theme, in fact, is that building good models depends far more on carefully cleaning and featurizing the data than picking a complex model.
In my experience, most learning algorithms fall into one of two broad categories:
- nearest neighbors (local): estimate values of new points by finding previously observed points close to the new ones
- linear models (global): estimate weights for each parameter; classify new points by summing up these weights
Within these classes, I typically find the need to use only some combination of the following four algorithms:
- k-nearest neighbors: a straightforward application of nearest neighbors
- gradient boosted trees: adaptively implement nearest neighbors by determining which directions “matter”
- elastic net: a linear model with controls on the sizes of the weights
- neural networks: iteratively apply collections of elastic nets to learn a hierarchy of increasingly complex weights
If some of these concepts seem hazy at the moment, that is perfectly natural. We’ll go into much more detail throughout the semester.
I spent several years after finishing graduate school working in industry in a number of positions. All of these consisted primarily of collecting and understanding large datasets and building predictive models from them.
My goal this semester is give you the skills to do this as well, at least with medium-sized datasets. You can then extend these skills to specific problem domains or learn how to make them work in real-time or at a large scale.
Because learning to work with data is the most important aspect of applying statistical learning, we will be working with a new dataset nearly every class period. Concepts and new algorithms will be intermixed with these, but the focus will generally be how to apply them to a specific task.
This class is cross-listed with mathematics and computer science. We have a fairly diverse set of academic backgrounds in the room. You will likely encounter parts of some classes which are a bit over your head. That is not just okay but, in my mind, ideal. Try your best to catch the basic ideas and don’t get worried about the details. At the same time, you may find the first few weeks a bit slow as everyone works to get up to the same level.
As mentioned, this is just a summary of the larger document located on the course website.
This course will be almost entirely code-centric. We will be using R throughout the semester. All of my notes (including this document) will be given as R markdown notebooks. These are self-contained files that include code, output, text, and graphics. All of your assignments will also be handed in as R markdown notebooks. No prior experience with R will be assumed. The course is designed to require no formal prior programming experience, but you should ideally have some familiarity with a scripting language such as R, Python, Matlab, or Mathematica.
Every course (other than this first one) through Thanksgiving will
have an associated file named
lab00.R, with the appropriate
class number replaced for the
By noon before the start of the next class, you must complete the
questions contained within the lab notebook. Assignments will be
submitted through GitHub; this process is explained more during
our next class.
Each lab will generally consist of looking at a new dataset with some missing responses. You need to build, documenting all of your steps and justify why you are doing them, a predictive model and fill in the missing values.
You will receive an auto-graded score by the start of class showing how well you performed on the estimation task.
While getting good prediction rates is important, your grades on labs will be more holistically based on your general approach and overall write-up. You will receive detailed feedback and a letter grade three times throughout the semester (roughly every 4th week) summarizing your performance on the labs.
You will also complete a course project. The last two weeks of the semester will consist of presentations of these projects. I plan to assign these just after Fall Break.
Final grades will be determined by the following weighting scheme:
- labs (66%)
- final project (34%)
The actual syllabus file has more details concerning the specifics of how to convert between numeric and letter grades.
because someone always asks
There is no required textbook and there are no exams, final or otherwise.
also, because someone always asks
We will have one class outside, probably around week 7. Details to follow.
Each class will generally be broken into three sections:
- first: walking through new concepts by applying techniques to a new dataset using my notes
- second: pairing or tripling off to share your results from the most recent lab
- third: share results from the recent lab with the entire class (time remaining)
Depending on the task, I may ask you to present the work of your partner rather than your own. Typically, the lab due for the next class will use the data from the prior classes notes.
Because of the class structure, ideally you should bring your laptop or at least a tablet to each class meeting. If this is a problem, I suggest pairing up with someone who can bring a computer to class. If this is also not feasible, please let me know as soon as possible.
Set-up for next time
For next time, you need to get GitHub, R and RStudio set-up. For GitHub, if you do not already have an account, go to the main webpage and create a free account. Try to pick a professional name as you may find that you want to make this public at some point:
Once you have created your account, add some basic information to your profile. At a minimum, include a photo and your preferred name.
We will be using the R programming language for this course. Please follow the instructions on the following website to download and install R on your machine.
- Windows: https://cloud.r-project.org/bin/windows/base/
- Mac: https://cloud.r-project.org/bin/macosx/R-3.4.1.pkg
Next, download RStudio. This is a nice GUI (I’m using it right now) that improves the user experience of using R. Download it here:
- Windows: https://download1.rstudio.org/RStudio-1.0.153.exe
- Mac: https://download1.rstudio.org/RStudio-1.0.153.dmg
Next class, I will explain more about how to use these and get them set up for your first lab assignment.
For next class
To recap, you have to do three things before our next class:
- create a GitHub account; add a photo and preferred name
- download and install R
- download and install RStudio
If you run into any problems with this set-up, please e-mail me or drop by my office. We can debug after class on Thursday if need be. You will NEED this to work for the assignment over the weekend.
As a last task, please complete the brief questionnaires I am handing out.
In case you’re curious, here are my answers:
- Taylor Arnold (Taylor or Dr. Arnold; please don’t mix them up though!)
- technically I never took linear algebra, but I do know it very well :)
- predicting stuff is super fun!
- text and image analysis with applications to linguistics and media studies