Machine Learning

Predicting Online Shoppers Purchasing Intention with H2O

Sep 9, 2019

Poster Code on GitHub Repository About the Data

Lung Cancer Detection from CT Scan Images Using DCNNs

Sep 9, 2019

Poster Report Code on GitHub Repository About the Project: Kaggle Data Science Bowl 2017

开会随感

Aug 8, 2019

JSM memory 前几天在 JSM 有幸听了 RStudio 谢益辉的讲座，并攀谈了几句。益辉大神听说我在用 blogdown，就说，希望我把自己的主页和博客坚持更新下去。他说这话的时候，我因为各种原因已经把最新一篇博客拖了两个月，有点心虚。说实话，我不擅长写作。早在高中的时候，我的作文就总被语文老师吐槽中心思想不明确。那时我比现在更多思多虑，有很多想表达的东西，奈何语言和逻辑水平不过关，很多纠缠在一起没理清楚的思绪就挤在一篇文章里，一股脑发泄出来。我没想明白要说什么，读者看了估计也觉得云里雾里。后来大学学了数学，整天在定义和符号里转圈，就更是把语文还给老师了。按说我不该再有兴致写东西了（一些中学的老友可能还记得我的老博客）。然而技术对人的影响就是很神奇。偏偏在放弃写作多年后，我偶然了解到益辉开发的 blogdown 。于是就这样，在强化自己敲代码水平的过程中，又重新有了写点什么的念头。扯远了。刚才说把最新一篇博客拖了两个月，是因为这两个月来我在马不停蹄的做事。之前在做的研究项目都接近收尾，有一个最近已经成稿放到了 bioRxiv上，另一个项目的文章也正在写。此外，我应朋友邀请，给大数 NGO 做了志愿者，写了有生以来第一篇微信公众号文章。再有就是去开了两次会，一次是五月底在华盛顿州贝尔维尤的 SDSS，另一次就是刚刚过去，在科罗拉多丹佛市的 JSM。总的来看，这个夏天我过得很充实，虽然和计划有出入，但也算做成了一些事情。两次去开会的经历最让我有感触。读博这几年，我不常去开会，总觉得还没做出成果，去会上演讲也没什么底气。这两次去了才意识到，开会虽然有展示自己学术成果的成分，但 networking 更重要。去和人交谈，了解别人在做的事，有益于更好地反思和矫正自己在走的路。尤其对于像我这样一个不务正业，总是被其他领域的“绿草” (the grass is greener on the other side) 吸引的博士生，去开会很有好处。像是在 SDSS 上，我就听到了许多关于业界 production machine learning 的内容，包括 Kubeflow, TFX Tensorflow，和 H2O。后来还幸运地以此为契机，邀请到 H2O.ai 的 Dr. Erin LeDell 来大数做讲座。在 JSM 上听到的讲座更是有意思，从严谨的方法创新（连续时间 MCMC 及其在贝叶斯推断中的应用），到用经典方法处理新问题（用 PageRank 算法分析 Reddit 网络数据找高分播主），内容五花八门。参加这些讲座，也不是说听过就真能学会新知识。只是我明显感觉到自己的技能与社会需求的差距，因而产生了强烈的危机感。有个关于 PhD 的段子，大意是说博士越读知识面会越窄，直到最后，达到 “know everything about nothing” 的境地。这本无可厚非——毕竟博士训练的目的就是培养精通某一领域的专才。然而从个人发展的角度，过于专注自己的研究领域而忽略周围的环境不太好，容易让人失了定位。学术界的革新可以发生得很快。比如去年声称横扫自然语言处理各项任务的 BERT ，仅这个夏天短短几个月，就先被 XLNet 超越，后者又在一个半月内被百度 ERNIE 2.

ML Summary Series (2) - Bias-Variance Tradeoff

Jun 6, 2019

We are all familiar with the workflow of supervised learning: fit models on training data, and make predictions on the test set. But why should performance of models in the training set tell us anything about that in the test set? Is model performance always generalizable to new data? Learning theory aims to address these questions under a general, abstract formulation of the supervised learning problem, without specifying details like the type of model or the source of data.

ML Summary Series (1) - Concepts

May 5, 2019

Machine learning (ML) is by no means new to me. I took ML courses in college and in grad school. In college, I was also in a study group where we went through Bishop’s PRML chapter by chapter. Later when I became a PhD, I use machine learning in plenty of projects, from the visualization of data after dimensional reduction through PCA, to building prediction models with logistic regression, random forest, support vector machines, etc.

Predicting Online Shoppers Purchasing Intention with H2O

ML Summary Series (2) - Bias-Variance Tradeoff

ML Summary Series (1) - Concepts

Yuqing

Predicting Online Shoppers Purchasing Intention with H2O

Lung Cancer Detection from CT Scan Images Using DCNNs

开会随感

ML Summary Series (2) - Bias-Variance Tradeoff

ML Summary Series (1) - Concepts