Since moving full-time into a data science role, I've learned a few dirty secrets.
First, that the vast majority of data science is just using simple, established algorithms that somebody else has written. (Feature engineering is the real "secret sauce" and the usual path there goes like this). With Kaggle, "it has almost always been ensembles of decision trees that have won competitions. It used to be random forest that was the big winner, but over the last six months a new algorithm called XGboost has cropped up, and it’s winning practically every competition in the structured data category." (Andrew Fogg).
Second, that the most accurate algorithm is not necessarily the best. For instance, you don't have to be 100% accurate in a marketing campaign. It's much better to deliver quickly with "good enough" results. My boss (who has a PhD in machine learning) tells me that the "best" recommender systems will include one product in its list of recommendations that it knows you already have and are unlikely to buy again. Apparently, this makes people feel as if the recommendations are appropriate to them.
Thirdly, a lot of the exploratory stage and prototyping is not done on the Spark command line in Scala or even Python but in a point-and-click environment.
Adventures in Knime
Knime is a startlingly good, drag-and-drop tool for data science. It's free, open source and based on the IDE every Java programmer knows: Eclipse. It comes with lots of examples and has a great community behind it.
I played around with an example from Knime's remote repository (009002_DocumentClustering) that takes titles of medical papers for two different areas, snowball stems the text and tries to cluster the documents.
It looks like this (with me having drag-and-dropped a few PCA nodes over to the right):
The PCA Compute node can then have its covariance matrix and spectral decomposition saved as ARFF files (Attribute-Relation File Format). You need a library called liac-arff if you want to read the output in Python, something like:
import arff, numpy as np
dataset = arff.load(open('mydataset.arff', 'r'))
data = np.array(dataset['data'])