You are reading Articles

Data Science Curriculum

LearnDataSci is reader-supported. When you purchase through links on our site, earned commissions help support our team of writers, researchers, and designers at no extra cost to you.

Introduction

This curriculum is designed to serve as an overview of the tools, techniques, and knowledge required to become a successful data scientist. It assumes a small level of basic scientific and statistical understanding; those newer to the field may want to brush up on those baseline skills, while those with more experience may find the basic sections to be a simple refresher. Not every skill we recommend will be used on a given project, and advanced projects will sometimes require novel research or using techniques and tools not listed here. However, a facility with the items in this curriculum should leave you both competitive in the landscape of current data scientists as well as equipped to learn new skills as necessary.

Most critically, the crucial role of a data scientist is being able to reliably understand and manipulate data to produce meaningful insights, and to differentiate meaningful insights from spurious ones. Practicing and learning this skill can be much more difficult to come by than tutorials on how to perform a given analysis in a certain programming language. That being said, critical thinking and problem solving, fundamentals to the data science process, aren’t exclusive to data science training. Many backgrounds whether in the humanities or elsewhere will have set up their foundation. Practice in utilizing these thought processes on novel data, and in acquiring new knowledge through external resources and lessons, will act as a strong foundation for any new or aspiring data scientist.

A Note on Timelines

Timelines are expressed as three categories and indicate the following ranges of expected learning time. In all cases except explicitly otherwise, these indicate time-to-learn for immediate usage as a practical data scientist, not time-to-master. For instance, for data scientific use, a short amount of time should be devoted to the basics of version control using Github, so it is denoted Short Term. Mastering the variety of version and repository control available in a system like Github is a much longer term project.

Short Term: Measured in minutes to hours
Medium Term: Measured in hours to days
Long Term: Measured in days to weeks

Baseline Knowledge

Expected to have been already acquired:

Most data science relies heavily on an effective conceptual understanding of the math involved, however, the actual math is performed via programming. In general, this means that an understanding of where mistakes might occur in math is sufficient to guarantee proper processing.

Simple algebra
Simple order of operations
Experience reading plots and visualizations of data

Foundational Skills

Scientific method (Short term)
Hypothesis testing (Medium term)
- What p-values actually mean
- Selecting the appropriate test for the data or question at hand
- Bootstrapping and non-parametric methods
Research-driven problem solving (Short term)
- Search
- Stack Overflow
- Github Issues

Programming

(Medium term to learn from scratch, long term to master)

Primary languages
- Python
  - Installation and environment handling
    - Pip installation
    - Virtualenv
    - Anaconda distribution
  - Critical Packages:
    - Numpy
    - Pandas
    - Matplotlib
    - Scikit-learn
- R
  - RStudio
  - Package installation
  - Critical Packages
    - Tidyverse
    - ggplot2
    - Dplyr
    - Tidyr
Version Control (Short term)
- Github
Notebook format (Short term)
- Jupyter Notebook and Lab
Big Data concepts and applications (Medium term)
- Hadoop / Spark
  - Purpose and function of map/reduce
  - Scala Programming
- In-memory data handling and streaming
- Apache Beam
Containerization (Medium term)
- Primary
  - Docker
- Secondary
  - Kubernetes, Swarm, etc.

Data Wrangling and Cleaning

(Short term)

Data dimensionality
- Common characteristics of 1D Data
  - Autocorrelation
  - Seasonality
  - Timestamp handling
  - Frequency
- Common Characteristics of 2D Data
  - Matrix multiplication
  - Curse of dimensionality
- Handling missing data in a principled way
  - Avoiding look ahead bias
  - Forward and back filling
  - Interpolation
  - Windsorizing
- Common data formats

Databases

(Medium term)

SQL language
- Common commands:
  - Creating tables
  - Updating tables
  - Select
  - Delete
  - Relationships
Database types
- Common varieties to expect: MySQL, PostgreSQL
- SQL vs. NoSQL
- Cold Storage
- Data specific
- Time series
- Graph

Data Visualization

(Short term)

Interpreting data using visualization methods
- Exploring data by visualizing it
- Fitting kernels
- Covariance matrices
- Heatmaps
- Confusion matrices
- 2 and 3 dimensional plots

Statistics

(Long term)

Means
Variance
Outliers
Statistical moments
- Testing for normality
Correlation
- Pearson
- Spearman
Statistical distributions
- Discrete vs. continuous
- Primary examples
Linear methods
- Regression
- Multiple and Hierarchical Regression
Nonlinear methods
- Logistic regression
- Bayesian statistics
  - Bayes rule
  - Defining priors
- Markov Chain Monte Carlo (MCMC)
  - R: STAN
  - Python: Pymc3
- Variational Inference
- Hierarchical modeling

Machine Learning

(Medium Term)

Classification vs. regression
Ensembling
- Boosting
  - XGBoost
  - Bagging
- Decision Trees
  - Random forests
Neural Networks
- Rudimentary backpropagation
- Activation functions
- Layering
- Convolutions
- Recurrent
Hyperparameter tuning
Transfer learning
Generative adversarial networks
Reinforcement learning
Unsupervised methods
- Clustering
  - Hierarchical
  - K-means
- Auto-encoders
- Principal components analysis
- Independent components analysis
Data quantity requirements
- Resampling and data expansion methods

Cloud Computing

(Short Term to make decisions on what to use, Medium Term to properly utilize an individual product)

A basic understanding of the major cloud providers available to data scientists
- Google Cloud Platform (GCP)
- Amazon Web Services (AWS)
- Microsoft Azure
Common storage services
- Google Cloud Storage (GCS) / Buckets
- Amazon S3
- Azure Blobs
Common simplified scaleable functions
- Google Cloud Functions
- Amazon Lambda
- Azure Functions
Data Science specific tools
Google
- Dataproc
- Dataprep
- AI Hub
- Jupyter Notebook
- Machine Learning Engine
- BigQuery
- AutoML (Vision, Tables, Language)
- NLP API
AWS
- EMR
- Redshift
- QuickSight
- SageMaker
Azure
- Azure Databricks
- Machine Learning Service
- Machine Learning Studio
- HDInsight
- Azure Notebooks
- Data Science Virtual Machine

Start Learning for Free

Meet the Authors

Back to blog index

Data Science Curriculum

Introduction

A Note on Timelines

Baseline Knowledge

Foundational Skills

Programming

Data Wrangling and Cleaning

Databases

Data Visualization

Statistics

Machine Learning

Cloud Computing

Recent articles:

The 9 Best AI Courses Online for 2024: Beginner to Advanced

The 6 Best Python Courses for 2024 – Ranked by Software Engineer

Best Course Deals for Black Friday and Cyber Monday 2024

Sigmoid Function

7 Best Artificial Intelligence (AI) Courses

Meet the Authors

Cookie Policy

Data Science Curriculum

Introduction

A Note on Timelines

Baseline Knowledge

Foundational Skills

Programming

Data Wrangling and Cleaning

Databases

Data Visualization

Statistics

Machine Learning

Cloud Computing

Get updates in your inbox

Recent articles:

The 9 Best AI Courses Online for 2024: Beginner to Advanced

The 6 Best Python Courses for 2024 – Ranked by Software Engineer

Best Course Deals for Black Friday and Cyber Monday 2024

Sigmoid Function

7 Best Artificial Intelligence (AI) Courses

Get updates in your inbox

Meet the Authors

Get updates in your inbox