-
Shelter Animal Outcomes Presentation/Code (fin):林子祥, 簡子軒,潘星丞,林唯德
- House Price (fin): 田兆元, 簡子軒, 曾品華, 高淑婷, 江嘉容
- Bosch Production Line Performance: 林子軒
-
ML of IMDb Movies evaluation (ppt): 林楷文, 梁智傑, 馮獻慶, 林修禾
XGBoost, GBM and boosted trees
- A gentle introduction to XGBoost by Brownlee contains many informative links.
A recent (June 2017) FB post of Yuan-Chun Ivan Chang
今天的討論,我突然發覺有些“只是會用現成工具,卻完全不知其所以然”的人大有人在。甚至只用套件中設定的參數,而完全不懂那些參數的意義。我要如何才能確認這些人是真的知道他們在做什麼?學生的話,我可以直接問,別人的學生我怎麼問呢?問了好像也在問他的老師,不是嗎?
A glance of DL
ROC, AUC, pAUC
Criteria
- Training/Testing Error: Convenient but rough
- Confusion matrix/Contingency Table: Better but restrictive to discrete classifiers (or probabilistic/score classifers of given thresholds)
- ROC
“Off-the-shelf” machines
Refer to HTF2009, Section 10.7: “Off-the-shelf” procedures for Data Mining
Also Section 10.8: Spam Data. Pay attention to performance comparisons of these machines.
Two-sample t-test:Why?- MeNemar test
Homework: Check 5/22
- Exercise 10.6
- Construct/Redo Table 10.1 using your data and your favorite machines. Prepare a short presentation (10 min or so) based on your new Table 10.1 and performance evaluation/comparison similar to Section 10.8
BD Prediction: Drive 1
Homework/Discussion 1
HTF 2009: Exercise 2.1, 2.8; 3.3(a), 3.30; 4.2, 4.3, 4.9
Cehckpoint 4.27(Thr)
A Learning Path of (S)ML via R/Python
VS
- Learning R or Python
- Learning Statistics, Machine Learning, Linear Algebra, etc
- 同時學習 vs. 螺旋學習
適合者: 會些統計(如迴歸等),會些語言(日本語 如 C++)
The Path
Read (foundation: books, papers)
- Textbook: Hastie, Tibshirani and Friedman (2009). The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd Edition. (aka. ESLII) Springer-Verlag.
- An Introduction to Statistical Learning with Applications in R (aka. ISLR), Markham’s summary @ R-bloggers; Intro to linear regression (python).*, Warmenhoven’s repo for ISLR python codes
View (slides, docs)
- Ensemble Learning (Attention to p7-p10, regression as a learning machine)
Play & Hack
- Playgrounds: Kaggle; CrowdAI, DrivenData, CrowdAnalytix (3 Kaggle alternatives)
- More hacks: 25+ websites to find datasets for data science projects
- Path suggested by Analytic Vidhya
* Diagnosis and remedial measures are needed for sound GLM (regression, ANOVA, ANCOVA) statistical analysis, particularly for modern high dimensional data (small n, large p)
把Python當R用
User Background
- Tasks
- Statistical data analysis: general linear models (regression, ANOVA, ANCOVA), generalized linear models (logistic regression), PCA, Multidimensional Scaling, simulation
- numerical analysis (numerical integration, optimization)
- Statistical machine learning: FDA, Boosting variants, SVM, Random forests
- R core: R (download 0-cloud)
- R proc and packages: e.g. glm, glmnet, ada, GAMboost, e1071, randomforest, rpart. Search MRO/package
- IDE: R , Rstudio (IDE for R)
Why Python?
- Computational musicology:
Miditool box (matlab),few R packages —> music21 - Kaggle: Kernels
- Opensource +1: Linux, libraoffice, LaTeX, etc
Python <~ Ruser
- Core: Python 2.7.x or Python 3.5.2
- IDE: Jupyter, Spider
- packages : PyPI
- 把Python當R用
Example: DeepBach
-
Project/Problem oriented: DeepBach: a Steerable Model for Bach chorales generation
- Github: /DeepBach
- Result:
musicscore