I used an out-of-the-box algorithm, messed around a bit,
and definitely did not make the leaderboard.
Because that is not Kaggle competitions. Nearly everyone on the leaderboard is proficient in data analysis and machine learning. They all tried that out-of-the-box algorithm for their attempt. But they did not give up so easily.
Understand the business problem
If you want to predict flight arrival times, what are
you really trying to do?
This is not different from Kaggle competitions, this is a tip for performing better in Kaggle competitions. See also the GE Flight Quest: https://www.gequest.com/c/flight Those winners used industry-standard machine learning, optimization techniques, but also creative insights and hunches, like tweaking the target labels:
"A next step is to ask, “What should I actually be predicting?”. This is an important step that is often missed by many – they just throw the raw dependent variable into their favorite algorithm and hope for the best. But sometimes you want to create a derived dependent variable. I’ll use the GE Flight Quest as an example: you don't want to predict the actual time the airplane will land; you want to predict the length of the flight; and maybe the best way to do that is to use the ratio of how long the flight actually was to how long it was originally estimated to be and then multiply that times the original estimate." - Steve Donoho - http://blog.kaggle.com/2014/08/01/learning-from-the-best/
Furthermore, it is entirely clear to everyone that data science in a business setting and in a competitive sport setting is different. To say they are equal, would be to say something like: paintball is equal to being in the military. But to say that Kaggle is not machine learning is to say: paintball requires no marksmanship.
There are some very messy, unwieldy datasets on Kaggle right now. For example the Seizure Detection challenge has many GBs of raw sensor data, from just a few patients. This would require a competitor to clean, understand problem domain, understand evaluation metrics, measure cross validation and put your model in production on your laptop in the evening hours.
The author of that blogpost is invited to team up, with me or others. Let's see if we can use machine learning to improve some pressing issues. I'd also love it if Stripe can host a contest on Kaggle.
Completely agree. Everything that is supposedly not addressed in Kaggle actually is, aside from productionizing and monitoring your model in production. Sure it's a simplified less open ended version of what you'll encounter in the real world, but that doesn't mean that many of the core concepts don't translate. It's kind of like saying that you shouldn't do your calculus practice problems, because math in the real world is never so straightforward.
I think the most legitimate knock against Kaggle is that in many business settings there probably isn't much value in improving that extra .00001 (but obviously there are exceptions).
Anyway, I think that that even someone very experienced in machine learning would learn something doing a Kaggle competition, especially if the competition is in an area outside their core expertise.
"Once I played against an NBA player and I couldn't do anything against him. I felt sad and demoralized. I still don't think I could win against an NBA player, but everyday I do (among other things) play basketball with my friends! And, you know, there're so many problems in basketball aside of dribbling and stuff: you have to find space to play with your friends, you have to convince the guy that doesn't want to play this evening. Once I didn't have time to play that day, and if it ever happened to you: it doesn't matter how good you are at dribbling if you don't have time to play. So I decided to write an article for you guys to know: that stuff they do at NBA has nothing to do with basketball!"
To me it sounded more like "the problems they do at Mathematical Olympiads are not the real mathematics done at academia/the enterprise," which turns out to be quite true in many ways.
Not just math -- even the problems done at programming contests have fairly little to do with programming!
I've worked at Google for nearly 10 years -- not sure I can solve the Google code Jams :) Or at least I haven't been motivated to do so. I know all my basic data structures and algorithms, but the questions don't seem that motivating.
Terminology would probably be important – programming contests have a lot to do with programming, you literally sit and write programs that pass the test cases.
I'm sure you meant it has fairly little to do with day-to-day programming for the majority of software engineers, which is correct.
Regarding the article, I also disagree with it. I feel like the author is saying "algorithms != programming contests" which is false. Algorithms are the essence of programming contests, just like 'machine learning' is the essence of Kaggle competitions. Open up Bishop's "Pattern Recognition and Machine learning" and you won't find anything about "deploying your model to production", but you will find lots of math on how to build models because that is what ML is. What the author seems to be describing is data-science.
I'd rather say that mathematics at academia aren't real mathematics. It pretty much depends on academia, of course, but what is generally taught to students under the name of "mathematics" (especially on technical programs, like engineering, CS, etc.) has less mathematical value than typical problem at Olympiads. But that wasn't my point, anyway.
Yes, that is true that "pure something" is almost always significantly different from "something in the real life". That's pretty obvious, actually. I guess that there's virtually nothing in the "real life" that would consist from only one "core" activity. In fact, it's the opposite: you make some activity "pure" in order to master it, because in the real life there's so much that distracts you from mastering that one essential skill, whatever would it be. Plubmers and painters (and almost everyone else) must to be able to talk to others and to solve totally human-related problems in order to do their job. Person you would call sniper actually does much more running, crawling, waiting and hiding than shooting. Or does it really surprise anyone that policemen do tons of paperwork every day and not so much of that stuff they show in the movies? And somewhat opposite example: it's obvious that cook's work isn't only slicing vegetables, but have you ever seen how any real cook handles his knife? Please do, it's sight to behold. There's nothing "pure" in this world, no surprise here.
So if it would be Yan LeCun writing something like this it would be justifiable (although I would be rather surprised to hear that), but in this particular case it's more like saying "Kaggle isn't important, data science is me!", in which case I find that NBA example to be OK.
To summarise: "data science is more than building a model" — yes, "Machine learning isn't Kaggle competitions" — no.
I'm a quite new Kaggler, but did some similar machine learning competitions on similar plateforms and I have to say that this article is quite bullshit.
How can a data scientist actually complain that the low value data munging work is already done for him ? Gosh, it should be the other way around - we should complain about how much of this we have to do at work !
This is exactly my experience and that of most people in the field I've worked with. In fact, many people have said the quality/processing of data is much more important than the machine learning model you use.
The exception is those fields that have physical data, like computer vision or speech recognition. In those fields, the actual model matters a lot more.
I think every job has this tendency, where the public focus on the most exciting and interesting part, and ignore the mundane but also extremely important parts.
And on the matter of skill/ability, in spite of not being a Kaggle winner, the author couldn't do their job without a good understanding of machine learning models. To do machine learning in practice, you must know both the software development/systems side and the maths/stats/ML side.
Machine Learning is just a part of many steps required to solve a practical problem. In my opinion although data cleansing/pre processing is important , but so are other things like feature engineering, model selection, understanding of machine learning models. And kaggle competitions are great way to practically learn these other things.
Hm, looks like this argument about Kaggle pops up all the time.
This is true, that getting and cleaning data is an important part of a workflow in data science and it's a necessary skill to have. But I don't believe anyone ever claimed Kaggle is enough to become a good data scientist. That's like studying algorithms is very useful, but who said it's enough to become good at software engineering? The skill consist of multiple parts which can be practiced separately or in combination in a real project where you do everything from scratch.
It's not as straightforward as Kaggle representing one stage in a pipeline. Kaggle also fixes this stage in an unnatural way.
In real life, there is a feedback loop from modelling, to data processing. In Kaggle all you can do is improve your model. You can't go back and change how data was processed, or collect different sorts of data.
Also, some flows simply don't fit into Kaggle's train/test paradigm. E.g. suppose you have an online algorithm that continuously updates parameters. Or situations where the model is used to generate the train/test data.
Finally some situations may have stringent computational requirements, e.g. <100ms to classify a single instance.
Sure, everything "in real life" is more complicated. I think Kaggle to data science is as programming contests to Software Engineering. Will you say that doing programming challenges is "unnatural"? Is it useless or bad? Because you may also argue that you never have to solve problems for speed, you rarely see clearly-defined problems, you rarely get to apply complicated or rare algorithms in real life situation.
The thing is Kaggle helps developing certain skills. It doesn't mean it helps you developing all necessary skills for your "real life". Your real life and your job may require unique combination of skills and the only way to fully prepare for it is to actually do real project that are specific to your job.
I'm selecting my final year (undergrad) project in the next couple of weeks. One of my professors posted a project that will simply be entering a competition on Kaggle. I don't know much about machine learning yet and I thought that this would be a great way to pick up something I've always been interested in.
It sounds like a great idea. Even though the article is entirely correct in that Kaggle competitions are only a small part of machine learning, they are still a good way to learn some aspects.
Even the data-cleaning that the author claims is missing from Kaggle competitions, is not really missing, it's just Kaggle have made your data cleaning much simpler.
The main advice I would give is to find a problem where you can get an understanding of the data, i.e. not just treat it like a black box.
It's a shame there is a huge disconnect between what data scientists ACTUALLY do in their day to day vs what's published in the media. Many people see machine learning as synonymous with data science when a data scientist's real job is to leverage data to achieve business objectives.
This has a wide range of implications from producing reports, to visualizations for understanding different kinds of trends.
Model making is, imo, the fun part of the process, but by far from what actually needs to be done.
That being said, kaggle is a great platform to learn on.
There is a lot of value in understanding how to train a model, but it's only as valuable as understanding how to take data from disparate sources, cleaning and normalizing it, doing some EDA, and then understanding AFTER all that if you need a model to achieve your goals.
I think it is more accurate to say that data science isn't Kaggle. The process of taking a data set and fitting it to the most robust model is certainly machine learning.
There is too much hype, solving data-related problems requires many different skills and it depends on the problem. In general you need background in quite a few of these areas linear algebra, probability, statistical modeling, stochastic processes, computer science, programming, signal processing, etc. We used to call this electrical/computer engineering :-)
It sounds like someone justifying their own shortcomings. "Since kaggle proves I am not that good at the 'math' part of machine learning, I will dismiss it by pointing out that it does not include all of the steps necessary to make a machine learning approach work for a business"
"A next step is to ask, “What should I actually be predicting?”. This is an important step that is often missed by many – they just throw the raw dependent variable into their favorite algorithm and hope for the best. But sometimes you want to create a derived dependent variable. I’ll use the GE Flight Quest as an example: you don't want to predict the actual time the airplane will land; you want to predict the length of the flight; and maybe the best way to do that is to use the ratio of how long the flight actually was to how long it was originally estimated to be and then multiply that times the original estimate." - Steve Donoho - http://blog.kaggle.com/2014/08/01/learning-from-the-best/
Furthermore, it is entirely clear to everyone that data science in a business setting and in a competitive sport setting is different. To say they are equal, would be to say something like: paintball is equal to being in the military. But to say that Kaggle is not machine learning is to say: paintball requires no marksmanship.
There are some very messy, unwieldy datasets on Kaggle right now. For example the Seizure Detection challenge has many GBs of raw sensor data, from just a few patients. This would require a competitor to clean, understand problem domain, understand evaluation metrics, measure cross validation and put your model in production on your laptop in the evening hours.
The author of that blogpost is invited to team up, with me or others. Let's see if we can use machine learning to improve some pressing issues. I'd also love it if Stripe can host a contest on Kaggle.