class: center, middle, inverse, title-slide # Introduction to MATH389 ### Taylor Arnold ### 2021-01-18 --- # Welcome! Welcome to Statistical Learning! Today we are going to go through a few notes, which give a long-form version of the posted syllabus and an outline of the course. We will then walk through setting up the software we will be using throughout the semester. Feel free to ask questions at any point during the semester. Please either (i) use the Zoom raise hand feature, (ii) post your question or just a question mark in the chat window, or (iii) wait for a pause and shout out directly. --- # Website As reminder from my emails to the class, all of the materials for the course can be found on the course website: - https://statsmaths.github.io/stat389-s21/index.html Make sure to bookmark this somewhere easy to access. The page will remain online indefinitely for your reference following the semester. --- # Who am I? - Taylor Arnold - Ph.D. in Statistics - Faculty in Mathematics, Computer Science, and Linguistics - Research applies computational techniques to explore large text and image datasets. I work closely with scholars in linguistics and media studies. - Lots of industry experience in data anaysis across different fields: - health care - DARPA / DoD - insurance - telecom (AT&T Labs Research) --- # What is statistical learning? - Machine/Statistical Learning (ML) is a branch of artificial intelligence - Uses data to create models that make predictions or detect patterns - Methods mostly fall into two groups: - **Unsupervised Learning** finds patterns and structures in data w/o a specific goal - **Supervised Learning** creates models that learn to make predictions on new data - Examples of tasks: - Predict whether an email message should be put into a user's spam box. - Predict the sale price of a house based on its size and location. - Find and identify all the faces found in a video feed. - Cluster a collection of news paper articles by themes. - Flag suspicious product reviews that should be manually investigated for fraud. --- # Teaching ML - **Mathematical Approach**: Focus on theoretical properties of various methods, using the language of probability and numerical analysis. - **Statistical/ Data Science Approach**: Focus on the application of ML techniques in order to understand complex datasets. - **CS/Engineering Approach**: Focus on implementation and performance of ML techniques and algorithms. --- # Our approach - Mathematical Approach: Focus on theoretical properties of various methods, using the language of probability and numerical analysis. - **Statistical/Data Science Approach: Focus on the application of ML techniques in order to understand complex datasets.** - CS/Engineering Approach: Focus on implementation and performance of ML techniques and algorithms. While we will cover techniques that are (mostly) applicable to a wide range of applications, this semester will focus primarily on applications to large collections of textual data. --- # What will you learn from this course? - understand the terminology of predictive and unsupervised ML methods - how to write, run, and document data-drive code in the open-source R programming language - how to use and understand a core set of general-purpose, interpretable ML methods - how to use and understand several specific methods for working with textual data - how to summarize and present the results of an exploratory analysis of data that integrates ML methods --- # What won't you (directly) learn? - a laundry-list with dozens of ML methods - theoretical justification/analysis of ML methods - implementation details of ML methods - deep learning models I have, however, tried to link to additional resources throughout the notes for students interested in theoretical or implementation details. --- # Things I expect from you - regular attendance in class, arriving on time and being present for the full class period (more than 2 absences may effect your final grade) - make progress in class coding activities - before next week, read through the "Using R to Manipulate and Visualize Data" review guide posted on the course website - complete and present five course projects - complete a two-page, end-of-semester, self assessment of what you have learned --- # Class Groups - Most of the work you complete for this course will be done in (fixed) groups of 3-4 students. - Despite some challenges, I have found it to be the best solution to making remote learning resemble in-person learning. - There will be plenty of time to work together in break-out groups, and all of the other work should be possible to do asychronously. - Projects and course work will be completed as a group. Your self-assessment will be submitted individually. - I prefer that you all try to organize yourselves into groups. - Send me, by email, your group preferences by the end of the week. - I think groups of 3 are ideal, but 4 is also possible. - If you want to pair up, I will combine pairs into groups of 3 or 4. - I except groups to be remain fixed throughout the semester, but we can discuss changes if issues arise. --- # Grades - All work will be uploaded to a Google Drive folder shared with your class group. - You will not receive a specific grade for classwork (called a "lab" in my notes); just remember to upload whatever you have finished at the end of class. - Projects will be graded as either "satisfactory" or "unsatisfactory" - All students who have earned satisfactory grades on the five projects, and missed no more than two classes, will earn a grade of at least a B+. - Grades of B+, A-, A, and A+ will be given according to each student's self-assessment and consideration of their projects and course participation. --- # Class Outline - First four weeks: - daily notes outline key terms and methods in ML - followed by a "lab", programming questions that your group will work though using a shared screen - time catch up on the basic mechanics of working with data in R - Remaining ten weeks: - focused on working on the five projects (two weeks each) - will include 1-2 new methods relevant to each project; these have short "labs" as well - time to work in class as a group - give a 10-minute presentation to the class - Project topics: - IMDb movie reviews: predicting how many stars a movie review gives - Amazon product reviews: predict the author of a review - Yelp reviews: predict author of the review and cluster the corpus authors - Wikipedia: detect themes in a subset of Wikipedia articles - News: combine classification and theme detection in a time-series dataset --- # Office Hours - Office hours are held using the same Zoom link as used for class. - Given social distancing, I find fixed office hours to not be very useful. - I make a habit of staying after class as long as students have questions. - If you want to meet right before class, just let me know and I will log in early. - I will have office hours at 7pm the night before each project is due. - If you have extended questions, personal concerns, or are unable to meet on either side of the class meeeting, please feel free to email me or ask in class for a dedicated appointment. - You can also always email questions. - I usually respond within one work day; if not, feel free to re-send. - It's easier for me if you send the text of any code errors you have (paste or attach the code), rather than a screen shot. --- # Frequently Asked Questions - **Is there a textbook for the course?** No, we will be using my own notes and references to other freely available resources. - **Are there any exams?** No. - **Is there a final exam?** No. - **Can I switch sections?** Not this semester, due to our use of fixed groups. - **Can I take the course asynchronously?** Not this semester, due to our use of fixed groups. - **When are the projects due?** Tentative dates are given on the course website.