I recently entered into my first Kaggle competition with a friend while in my graduate (I'm undergrad) Data Mining class. At first I thought this course was going to be super difficult and while the content is certainly overwhelming, I felt that most people were pretty much at the same level of understanding that I had. The class was super interesting and although I could not get to know all the details for each chapter, they certainly opened my eyes to how amazing and vast data science can be.
For our final project, we formed groups and decided on whether to work on a boring UCI repo. project or a fun Kaggle competition with potentially real data (ironically, the topic we chose were generated). Rene and I decided on the TFI Restaurant revenue prediction challenge which is to see who can best predict a cross-sectional sample of annual Turkish restaurants' revenues. You can read the full project report here with the accompanying r code hosted on my Github here (Warning: the r code is really messy but entails EVERYTHING).
In summary, we started off with model first then feature second. Looking back, this was a huge mistake and it should of been the other way around. Models employed (in order) were: Linear regression, Random Forest, SVM, Ensemble (various crap like GBM, lasso, etc). Ensembling was done last when we were out of options. When I got to implementing Random Forest, Rene was at approximately 455th place with a simple RF. The data was pretty weird in terms of train/test set features; our report details all the issues as "problems" in section 3. After a few clever hacks with kNN and K-Means, I fed in both SVM and RF and got 58th place! However, Top 100 was short lived as other competitors caught up within periods of a few days... I had to focus on other exams so I handed everything off to Rene. After a few more days of tampering, we couldn't seem to get better results and concluded it at that.