CS 4641 B: Machine Learning (Summer 2020)
Course Overview
- Instructor: Xin Chen (xchen384@gatech.edu)
- Lecture time: Monday and Wednesday (3:30 pm-5:40 pm)
- Location: entirely online due to COVID-19 pandemic
- Online lecture link:: https://bluejeans.com/687936658
- Piazza: https://piazza.com/class/ka03twme81f5cl
- TA : Wendi Ren (wren44@gatech.edu) and Hua Jiang (huajiang@gatech.edu)
This course mainly introduces fundamental techniques in machine learning that widely used in data analysis.
Our emphasis is on two parts: the underlying math as algorithms and their applications.
- Basic math for data science and machine learning
- Supervised learning
- Unsupervised learning
- Reinforcement learning
Prerequisite for this course include:
1) basic knowledge of probability, statistics, and linear algebra; 2) Basic programming experiences in python
Office hour:
Piazza will be the main place for any discussions or questions. Students are encouraged to discuss anything on this course, such as unclear parts on the lectures, assignments or corrections on the content. Note that one part of the grading attendance is based on the discussions in piazza.
If there is something you do not want to talk in public, Piazza supports private message.
- Instructor: Mon 5:30 pm - 6:30 pm at https://bluejeans.com/687936658
- Wendi: Tue 1:00 pm - 2:00 pm at https://bluejeans.com/7788939771
- Huang Jiang: Wed 12:00pm - 1:00pm at https://gatech.bluejeans.com/6887816810
Schedule (Coming soon)
Grading
Assignments (50%)
- There will be 4-5 assignments. Each one is designed to test your understanding of the taught algorithms in our lectures.
- Each assignment includes two parts: programming and written analysis, except for the first one (a pure math assignment).
You are required to submit both the code and the report.
The assignments will be submitted through GT canvas. Any other submissions like email will not be considered..
- Although student is allowed to discuss the assignment, each student should submit their solution independently.
- All assignments follow the “no-late” policy. Assignments received after the due time will receive zero credit.
- All students are expected to follow the Georgia Tech Academic Honor Code.
Project (40%)
Team link is shared excel file.
Please fill in your project info in the table as needed.
Each project should have a team of 4-5 students.
Note that the standard will not be lowered if your team has less than 4.
Please contact the instructor if your team has less than 4 members.
You are encouraged to form a team on your own, otherwise I will assign you a team randomly.
In the following three sections, team member need to clearly claim your contribution.
If your name is not on the report or the slides, you will receive zero credit for the corresponding part.
- A project proposal (10%).
- Presentation (10%). Each team needs to grade other teams.
- Project report (20%).
We will have a lecture specifically on the content of the project and the requirement.
Several ML tools might be useful: Tensorflow, PyTorch, Scikit learn, Keras, Google cloud ML. A link on recommending ML tools is
Popular ML tools.
Class participation (10%)
This has two parts (5%+5%).
One part is that I will publish an attendance sheet on piazza and you need to sign there.
The other part is based on how active you are on piazza as an encouragement of asking and answering questions.
Project resources and dataset, thanks to Mahdi and Polo
Covid-19 pandemic
- Covid-19 cases data from Johns Hopkins University Center (JHU CCSE)
- Covid-19 in US from Kaggle
- Resources for Covid-19 world-wide from Havard dataverse
- Covid-19 public dataset from Google cloud
Tech comp
- Google Dataset Search
- Google public datasets. Thanks Revant!
- Numerous APIs from Google (e.g., Maps, Freebase, YouTube, etc.)
- Kaggle public datasets
- Yahoo WebScope
- Uber data: Anonymized data from over 2 billion trips
- Yelp
- Microsoft Academic Graph
- Zillow: real estate listing site
- Quandl - a dataset search engine for time-series data
- Amazon AWS Public Data Sets (Thanks Jonathan!)
- Data Science Initiative - Microsoft Research has various datasets and access to tools that can aid in data science research
Entertainment
- Movies data: IMDB
- Million song dataset by Echo Nest.
It contains not only the basic information of songs (artist, genre, year, length etc), but also some musical features(like tempo, pitch, key, brightness).
Thanks Minwei!
- Dataset about soccer games, players, clubs.
No API, but easy to scrape.
For a soccer player: transfer history, performance, nationality, birth date, etc.
For a soccer club: performance, squad, etc.
Thanks Ding!
- Retrosheet: MLB statistics (Game/Play logs)
- Social trends (Thanks Jonathan!)
Academic
- KDD Cup: annual competition in data mining, like Kaggle
- Numerous graph datasets (large and small): SNAP, Konect
- UCI also has a collection of links to various datasets sorted for various tasks (Classification, Regression, etc)
Thanks Vinodh!
- Academic domain: Microsoft Academic Search, DBLP
- Classification datasets
Thanks Amish!
- Academic torrents (terabytes) (Thanks Vaibhav!)
- Civil Engineering Dataset (Thanks Dr. Frost)
The summarized
- Awesome Public Datasets. Thanks Marcel Gwerder!
- List of lists of datasets for recommendations.
Thanks Jon!
- Large datasets publicly available. Thanks Gopi!
- The Free 'Big Data' Sources Everyone Should Know
The specialized
- Georgia Tech's campus data (has APIs): bus info, directory, building, T-square, room reservation, building facilities usage (e.g., electricity, lights, A/C, etc.), Oscar/course info/registration, etc.
- NYC Taxi data for 2013 (suggested by Chris Wong).
2013 Trip Data (11.0GB). 2013 Fare Data (7.7GB).
Visualization for a days trip. Thanks Jitesh.
- Data.gov: U.S. Government's open data
- IPEDS data: Postsecondary education data from National Centre for Education Statistics
- Bureau of Labor Statistics data
- Various geophysical datasets for the oceans (magnetism, gravity, seismology, etc).
Thanks Ryan!