CS 4641 B: Machine Learning (Summer 2020)
Course Overview
- Instructor: Xin Chen (xchen384@gatech.edu), Miguel Morales (mimoralea@gatech.edu)
- Lecture time: Monday and Wednesday (3:30 pm-5:40 pm)
- Location: entirely online due to COVID-19 pandemic
- Piazza: https://piazza.com/class/ka03twme81f5cl
- TA : Wendi Ren (wren44@gatech.edu) and Hua Jiang (huajiang@gatech.edu)
This course mainly introduces fundamental techniques in machine learning that widely used in data analysis.
Our emphasis is on two parts: the underlying math as algorithms and their applications.
- Basic math for data science and machine learning
- Supervised learning
- Unsupervised learning
- Reinforcement learning
Prerequisite for this course include:
1) basic knowledge of probability, statistics, and linear algebra; 2) Basic programming experiences in python
Office hour:
Piazza will be the main place for any discussions or questions. Students are encouraged to discuss anything on this course, such as unclear parts on the lectures, assignments or corrections on the content. Note that one part of the grading attendance is based on the discussions in piazza.
If there is something you do not want to talk in public, Piazza supports private message.
- Instructor: TBD
- Wendi: Tue 1:00 pm - 2:00 pm at https://bluejeans.com/7788939771
- Huang Jiang: Wed 12:00pm - 1:00pm at https://gatech.bluejeans.com/6887816810
Schedule (Coming soon)
Date |
Topic |
Assignment |
Due |
Readings |
May 11, 2020 |
Course Overview;
Class Video Lecture;
Class Notes;
|
|
|
GT Honor Code
|
May 13 |
Linear Algebra;
Class Video Lecture;
Class Notes;
Probability;
Class Video Lecture;
Class Notes;
|
|
|
Linear Algebra Review by Zico Kolter;
SVD tutorial ;
Probability Theory Review by Andrew Moore
|
May 18 |
Information theory
Information theory;
Class Video Lecture;
Class Notes;
|
|
|
The Differences Between Data, Information and Knowledge;
Entropy vs. Variance
|
May 20 |
Linear regression;
Class Video Lecture;
Class Notes;
|
Homework 1; Attendance sheet |
June 3;May 22 |
Simple linear regression in Matrix format
|
May 25 |
No class (Memorial day)
|
|
|
|
May 27 |
Regulization;
Class Video Lecture;
Class Notes;
|
|
|
Regularization integration;
Regularization math
|
Jun 1 |
Logistic regression;
Class Video Lecture;
Class Notes;
|
|
|
Logistic regression vs Linear regression
|
Jun 3 |
Project requirement;
Class Video Lecture;
|
Homework 2; Project proposal |
Jun 17th; Jun 14th |
|
Jun 8 |
Decision Tree;
Class Video Lecture;
Class Notes;
|
|
|
Intro to Decision Tree
|
Jun 10 |
K-means clustering;
Class Video Lecture;
Class Notes;
|
|
|
Curse of dimensionality;
Kmeans application and analysis
|
Jun 15 |
Hierarchical clustering;
Density clustering;
HW2-officehour;
Class Video Lecture;
Class Notes;
|
|
|
Concept of hierarchical clustering;
Density based clustering
|
Jun 17 |
Gaussian Mixture Model;
Class Video Lecture;
Class Notes;
|
Homework 3 |
July 1st |
Tools and examples of GMM;
GMM and EM algorithm
|
Jun 22 |
Principle Component Analysis;
Class Video Lecture;
Class Notes;
|
|
|
|
Jun 24 |
SVM;
Class Video Lecture;
|
|
|
|
Jun 29 |
No lecture
|
|
|
|
July 1 |
No lecture
|
|
|
|
July 6 |
Markov Decision Processes and Planning Methods;
Class Video Lecture;
|
|
|
Reinforcement learning improves behaviour from evaluative feedback;
Reinforcement Learning: An Introduction (chapters 1, 3, 5, 6, 7, 12, 16)
|
July 8 |
Bandit Problems and Model-free Reinforcement Learning;
Class Video Lecture;
|
Homework 4 |
July 22st |
|
July 13 |
Value-based Deep Reinforcement Learning methods;
Class Video Lecture;
|
|
|
|
July 15 |
Policy-based and Actor-Critic Deep Reinforcement Learning methods;
Class Video Lecture;
|
|
|
|
July 27 |
Presentation of projects
|
Presentation,
Project report
|
July 27th,
July 29th
|
|
Grading
Assignments (50%)
- There will be 4 assignments, and you can drop the lowest grade. Each one is designed to test your understanding of the taught algorithms in our lectures.
- Each assignment includes two parts: programming and written analysis, except for the first one (a pure math assignment).
You are required to submit both the code and the report.
The assignments will be submitted through GT canvas. Any other submissions like email will not be considered..
- Although student is allowed to discuss the assignment, each student should submit their solution independently.
- All assignments follow the “no-late” policy. Assignments received after the due time will receive zero credit.
- All students are expected to follow the Georgia Tech Academic Honor Code.
Project (40%)
Team link is shared excel file.
Please fill in your project info in the table as needed.
Each project should have a team of 4-5 students.
Note that the standard will not be lowered if your team has less than 4.
Please contact the instructor if your team has less than 4 members.
You are encouraged to form a team on your own, otherwise I will assign you a team randomly.
In the following three sections, team member need to clearly claim your contribution.
If your name is not on the report or the slides, you will receive zero credit for the corresponding part.
- A project proposal (10%).
- Presentation (10%). Each team needs to grade other teams.
- Project report (20%).
We will have a lecture specifically on the content of the project and the requirement.
Several ML tools might be useful: Tensorflow, PyTorch, Scikit learn, Keras, Google cloud ML. A link on recommending ML tools is
Popular ML tools.
Class participation (10%)
This has two parts (5%+5%).
One part is that I will publish an attendance sheet on piazza and you need to sign there.
The other part is based on how active you are on piazza as an encouragement of asking and answering questions.
Project resources and dataset, thanks to Mahdi and Polo
Covid-19 pandemic
- Covid-19 cases data from Johns Hopkins University Center (JHU CCSE)
- Covid-19 in US from Kaggle
- Resources for Covid-19 world-wide from Havard dataverse
- Covid-19 public dataset from Google cloud
Tech comp
- Google Dataset Search
- Google public datasets. Thanks Revant!
- Numerous APIs from Google (e.g., Maps, Freebase, YouTube, etc.)
- Kaggle public datasets
- Yahoo WebScope
- Uber data: Anonymized data from over 2 billion trips
- Yelp
- Microsoft Academic Graph
- Zillow: real estate listing site
- Quandl - a dataset search engine for time-series data
- Amazon AWS Public Data Sets (Thanks Jonathan!)
- Data Science Initiative - Microsoft Research has various datasets and access to tools that can aid in data science research
Entertainment
- Movies data: IMDB
- Million song dataset by Echo Nest.
It contains not only the basic information of songs (artist, genre, year, length etc), but also some musical features(like tempo, pitch, key, brightness).
Thanks Minwei!
- Dataset about soccer games, players, clubs.
No API, but easy to scrape.
For a soccer player: transfer history, performance, nationality, birth date, etc.
For a soccer club: performance, squad, etc.
Thanks Ding!
- Retrosheet: MLB statistics (Game/Play logs)
- Social trends (Thanks Jonathan!)
Academic
- KDD Cup: annual competition in data mining, like Kaggle
- Numerous graph datasets (large and small): SNAP, Konect
- UCI also has a collection of links to various datasets sorted for various tasks (Classification, Regression, etc)
Thanks Vinodh!
- Academic domain: Microsoft Academic Search, DBLP
- Classification datasets
Thanks Amish!
- Academic torrents (terabytes) (Thanks Vaibhav!)
- Civil Engineering Dataset (Thanks Dr. Frost)
The summarized
- Awesome Public Datasets. Thanks Marcel Gwerder!
- List of lists of datasets for recommendations.
Thanks Jon!
- Large datasets publicly available. Thanks Gopi!
- The Free 'Big Data' Sources Everyone Should Know
The specialized
- Georgia Tech's campus data (has APIs): bus info, directory, building, T-square, room reservation, building facilities usage (e.g., electricity, lights, A/C, etc.), Oscar/course info/registration, etc.
- NYC Taxi data for 2013 (suggested by Chris Wong).
2013 Trip Data (11.0GB). 2013 Fare Data (7.7GB).
Visualization for a days trip. Thanks Jitesh.
- Data.gov: U.S. Government's open data
- IPEDS data: Postsecondary education data from National Centre for Education Statistics
- Bureau of Labor Statistics data
- Various geophysical datasets for the oceans (magnetism, gravity, seismology, etc).
Thanks Ryan!