class: center, middle, inverse, title-slide # Introduction to Data Science ### Taylor Arnold ### 2020-08-25 --- class: center, inverse, middle, title-slide # Part 1: Welcome! --- # Welcome! Welcome to Introduction to Data Science! It is certainly shaping up to be a strange year. I am looking forward to the semester and hoping to find ways to get to know you all remotely and to make this course as fun and rewarding an experience as it usually is. -- Today we are going to go through these notes, which give a long-form version of the posted syllabus. You should feel free to ask questions at any point during the semester. Please either (i) use the Zoom raise hand feature, (ii) post your question or just a question mark in the chat window, or (iii) wait for a pause and shout out directly. --- # Website We will get to all of the course details (grades, assignments, expectations, and so on) soon. Before that, as reminder from my emails to the class, all of the materials for the course can be found on the course website: - https://statsmaths.github.io/stat289-f20/ Make sure to bookmark this somewhere easy to access. The page will remain online indefinitely for your reference following the semester. --- class: center, inverse, middle, title-slide # Part 2: Introductions --- # Questions 1. Name 2. Favorite food or cuisine 3. Favorite place or city 4. Field(s) of interest --- class: center, inverse, middle, title-slide # Part 3: Course Content --- # Description Data science is an interdisciplinary field concerned with drawing knowledge from data and communicating those results to various audiences. Data science needs to be learned *by doing* data science. -- This course broadly covers the entire process of collecting, cleaning, visualising, modelling, and presenting information for data. It has a MATH designation but is not a mathematics course. The focus is on applied statistics and data analysis rather than a detailed study of symbolic mathematics. -- By the end of the semester you will feel confident collecting, analysing, and writing about datasets from a variety of fields. You will be able to use these skills to address data-driven problems in a wide range of application domains. --- # Example analysis: Hans Roslin (2010) As an example of the kinds of analysis that we want to make possible, let's watch this short video presentation by Hans Roslin showing an analysis of 400 years of macroeconomic data: - https://www.youtube.com/watch?v=jbkSRLYSojo Do pay attention, as I will use this video as a reference throughout the next few weeks. --- # Outline The course content is broken roughly into two parts. In the first we learn so core techniques for working with data. These notes take about 6 weeks of the semester to cover, focusing on four main tasks: 1. Creating and organising data using a tabular data model. 2. Learning and applying a theory for data visualisation (Grammar of Graphics). 3. Learning and applying a theory for data manipulation (data table verbs). 4. Learning and applying a method for storytelling with data. And in the second part, we will practice applying these core methods to specific tasks. --- # Prerequisites and MATH209 The pace of this course assumes that students have had some prior exposure to a programming language and have taken a course in which statistical techniques were applied to the analysis of real-world datasets. However, **this is not a strict prerequisite**! Just be prepared to put in some extra effort earlier on in the semester. We receive a lot of questions about the relationship between MATH209 and MATH289. The first focuses a bit more on statistical inference, whereas the second is more oriented towards building a toolkit of computational models for working with data. The exact material covered in either depends as much as on who the instructor is as it does the course number. --- class: center, inverse, middle, title-slide # Part 4: Format --- # Online Format Due to social distancing guidelines and the need to maximize flexibility, Data Science is being taught using online instruction. Classes will be presented remotely over Zoom. A link to the Zoom room is provided on the course website. A numeric password was sent via email to the class roster. The course has been designed so that only minimal changes will be needed in the event that some or all students need to return home before the normal end of the semester. --- # Course Format Most class meetings will consist of working through a set of course notes and questions in the form of an *Rmarkdown* file (more on this later). This file will introduce new concepts and then ask us to apply these concepts to various datasets. Everyone in the class will have their own version of the file, and you are expected to follow along with the pace of the course. -- We may not finish all of the questions during class. In this case, it is recommended to finish the material as homework. However, there will be no formal handing in of these assignments. -- During the course I will call on students to share their work. There is no expectation for you to share your video, but you should be prepared to share your screen at any time with the class. -- Students are not allowed to record the text or video of the class without prior permission. When permission is granted, recordings must not be shared and shall be destroyed at the end of the semester. --- # Zoom policies Please follow the following policies when joining the group office hours and synchronous course meetings: - Use your UR Zoom login account. - Include a profile photo (but does not to literally be a "profile photo"). - Camera is optional. Feel free to turn on and off as needed. - Be prepared to share screen if called on. - Chat window is the easiest way to indicate that you have a question (either post or just type "?"). - Welcome to "raise hand" or shout out your question if that seems easiest. If you have any concerns about any of these items, please let me know. --- # Grades There will be several class projects assigned throughout the semester. Each project will be distributed with a rubric and assigned a total number of overall points. Projects will be submitted through Box. At the end of the semester, the overall project grade will be determined by adding together the total number or earned points and dividing by the total number of available points. The final grade will be computed by the following formula: - **Projects** 70% - **Participation** 30% --- # Attendance Policies For students electing the standard grading scheme, satisfactory participation includes: - attending most course meetings - arriving on time for class - actively following along with the day's materials - raising questions (in the Zoom chat or verbally) when stuck - being prepared to share your screen when called on Students will be called on in a random order to share their progress through the day's assignment. Participation grades are entirely focused on your *attempt* to engage with and follow the material; this not an intended to penalise anyone during the learning process. A full grading rubric for attendance will be posted during the second week of the semester. --- # Special Circumstances I completely understand that extenuating circumstances may arise throughout the semester that effect your ability to attend class or follow along with the pace of the course. Please contact me as soon as possible so that we can find the best path forward. --- # Alternative Grading Some students may be unable to travel to campus due to the ongoing COVID-19 pandemic. It is the University's policy that all enrolled students should be prepared to attend courses remotely from wherever they are; in reality this may be difficult due to shifted time-zones and situations at home. As an alternative, in this course students will be given the option to follow the material asychronously. In lieu of the participation grade, students following this option will be scheduled for 1-2 oral examinations during the semester. The grading weighting becomes, instead: - **Projects** 70% - **Oral Examination** 30% In the interest of equity, this option is open to all students, but must be chosen during the first week of the semester. --- # Alternative Grading (Oral Exam) The term *oral exam* sounds more intense that it will be in practice. I will simply ask students taking this option to walk through and explain a few of their solutions to selected in-class assignments. Full marks will be earned as long as you generally seem to understand the core ideas of course. --- # Office Hours I will hold open office hours during the semester via Zoom. This will likely be held in the evening to accommodate students from alternative timezones and to avoid class conflicts. Office hours are, of course, entirely optional. If you choose to attend, please arrive at the start of the listed session so that there is enough time for everyone to get their questions answered. You may leave the session at any time. Office hour times will be announced after the results of the student poll that will be given at the end of today's class. If you have questions of a personal nature (that is, not about the course material), please reach out via email with your question or a request to meet. --- # Pace of the Course Students in this course have a wide variety of backgrounds in statistics, mathematics, and computer science. With this in mind, it is important to consider a few things about the pace of the course. First of all, my notes assume that you have no background in statistical programming. Therefore: - If this material is new to you, take advantage of the first few weeks to catch up. Ask questions, attend office hours, review the notes in-between course sections. - If the material seems mostly like review, do not incorrectly assume the entire semester will be equally as easy for you. Be prepared for the pace to speed up as we reach material that is new to you. --- # Pace of the Course, cont. Secondly, if you are already familiar with programming in R, it is likely that I will be teaching you different ways to do things than you have previously learned. - I expect everyone to learn the specific methods presented in this course. - Pay close attention to the packages and functions I am using. They are designed to work together in a very particular way. - It is likely that alternative methods will appear to be fine the first time we encounter something (making a scatterplot, filtering data, ect.), but will break down in more advanced applications. --- # Frequently Asked Questions - **Is there a textbook for the course?** No, we will be using my own notes and references to other freely available resources. - **Is there a final exam?** No. - **When are the projects due?** The last project will be due during the last week of class. Dates and instructions for other projects depend on the pace of the course. - **What software will we be using?** We will be writing code using the R programming language using a new platform called RStudio Cloud. More on this next time. Classes will be held using Zoom and projects will be submitted using Box. --- class: center, inverse, middle, title-slide # Part 5: Survey --- # Me! - From New England: born in Maine, school in MA, ME, CT - Research on large text and image datasets in linguistics and cultural studies - Came to Richmond in 2016 - Last year in Lyon, France - Some previous work: - IBM (Healthcare) - Travelers (Insurance) - DARPA (social media) - AT&T (location analytics) - Own a Shih-Tzu named Sargent (you will likely see and hear him at some point) --- # You? It would be great to get to know all of you now as well. Please fill out the follow form (a link is on the main website as well): - https://forms.gle/gdVEoVAEoDLQCaEP7 We will be collecting additional data about ourselves throughout the semester as well, but this is a good way to get started.