Data Science for Everyone: course text#

By Arthur Spirling and Andrea Jones-Rooy.

We developed this collection of teaching notes for the New York University (NYU) undergraduate data science course, DS-UA 111: Data Science for Everyone (DS4E), offered by the NYU Center for Data Science.

DS4E is the first in the sequence for the Data Science major and minor, as well as the joint majors in Data Science & Computer Science and Data Science & Mathematics at NYU. This course may also be taken as a standalone experience for students curious about data science but not ready to commit to a major or minor, or by anyone outside of NYU who is interested in data science.

While much of this text is original material, several sections have been adapted from the excellent textbook Inferential Thinking by Ani Adhikari, John DeNero, and David Wagner, which was developed for the UC Berkeley (UCB) course Data 8: Foundations of Data Science.

The sections we have adapted from Inferential Thinking are based on versions from the last commit (64b20f0) of the book licensed under the Creative Commons CC-BY-NC license. We indicate the sections that are adapted from this commit throughout the text. All of our original work (any section not noted) is licensed under the Creative Commons CC-BY 4.0 license.

Specifically, major differences between this book and Inferential Thinking include:

  • We do not use the datascience package developed by UCB; we replace it with more commonly used packages (for example, most dataset manipulation is now done with pandas).

  • We’ve substantially revised the statistics material.

  • We’ve added more detailed discussions of causal inference, evaluating data, and ethics & data privacy.

  • We’ve added high-level previews of other exciting areas of data science, including reinforcement learning (RL), natural language processing (NLP), and text-as-data (TAD).

Using this text#

All computer code and analyses in this book are conducted in a Jupyter Notebook coding environment. If you are new to coding, we recommend you use the same or similar environment.

The data for all examples throughout the book can be found here. We encourage you to practice replicating our code and analyses with this data as you work through the book. Note that you will need to change the filepaths provided in our example code to point to where you have stored the data and/or the raw data URL on GitHub. See Ch. 6.4 for a more detailed walk-through of how to load and inspect datasets in a Jupyter notebook.