Interestingly, just yesterday, I found out that Linkedin Friend Suggest uses, among other things, co-logins from same IP address as a signal. On my test account that I created at work, it eerily showed me all my co-workers in the Friend Suggest list. Later, as soon as I logged in from home, it added my wife to my Friend Suggest list.
I wonder if one of the goals of a good Data Scientist is also to be not too accurate, lest the product create an eerie feeling among users! (remember the Target pregnant girl incident?!)
I think that there is a tension between "creepiness" and "effective marketing". This I feel is one of Facebook's core problems, where for them to maximize the value of their dataset, their ad targeting becomes incredibly creepy.
One issue is that from an end users perspective it makes it obvious how much information is being captured about them. While most people are aware that their information is being captured, seeing it plastered all over their facebook feed makes them confront it.
Worse than that are the questions that come with these ads - "Why am I seeing ads for baldness cures?" Is it because I'm a 30+ male, or is it because they have analysed photos I'm tagged in and detected my thinning hair? Sometimes it just feels mean!
This is primarily a challenge of data science working in a marketing environment and doesn't really permeate through all areas of data science, however it is the form of data science that is most visible. Therefore much of data science and the big data we work with gets lumped in with sleazy marketing.
I was pretty weirded out when the fake account I made specifically to use Spotify (and which has absolutely no information about me in the profile), got a friend request from someone from my grad program.
I'm pretty sure Twitter does this too. I've signed up for accounts in other browsers and then had my other accounts suggested to me as people I should follow.
This is so very, very true. The major change appears to be one of scale, rather than any qualitative change. Funnily enough, since I put predictive analytics (what does that term even mean, anyway?) on my CV I've gotten much more attention from recruiters and employers. I guess it sounds so much sexier than statistics.
More seriously though, the requirements to be able to hack up a prototype and talk to people are probably what hold back a lot of people who otherwise have the skills to be good "data scientists", or just scientists.
My current employers told me at interview that they had no data, and in the three months I've been there I've been slowly discovering that they have loads of it, unfortunately in multiple incompatible forms and jealously guarded by different departments. It is rather funny, though a little sad that they were essentially drowning in data and didn't realise it.
I agree that being able to hack up a prototype could really make someone stand out as a data scientist. The Insight Data Fellows program mentioned in the article has a 6 week program where the focus is on learning enough software development to hack a prototype by the end of the program. That could be a good way to go.
gaius, I've been appreciating your comments for a long, long time now. But you are wrong here.
There is a difference. It isn't a difference in fundamentals so much as it is a difference in focus.
Business Analysts give reports to CEOs about customer segments or the projected amounts of signups. They arn't even close to DS or statisticians.
Statisticians tell you about how a drug reacted with a control group or how likely it is that a population feels a certain way given the results of a survey or trial.
Data scientists harness data. They impact every user on a site. "Watch this video" "Follow this user" (recommendations) or "Silently ignore this user's impact on the algorithms that manage where this piece of content should go" (graph analysis) or "What exactly is in this photo" (object recognition) or "What combination of widgets leads to the maximal amount of engagement" (optimization) or "I have this paper that I really like, show me more that are just like it" (recommendations, document classifications, NLP).
It is different. The focus is on users and what they will do or should do or should see. To call them statisticians leads to much less understanding of the value that DS bring. Put me in a room with an actuary from an insurance company. Neither of us could possibly do each others jobs. Neither of us have the others skill set.
Now, both of us could learn and get up to speed on how the other works, but a sys admin and a web developer could swap roles more easily than an actuary and a DS. Yet nobody is complaining that we call devs and sys admins different titles.
That's not a counterargument to his point. You're parsing job titles down to the atom, and concluding that "data scientist" is different than "scientist" is different than "statistician", is different than "analyst". Gaius is saying that this job responsibility has been around for a long time, but that people are reaching to find reasons to give it a new name -- exactly what you're doing.
If you ask me, the phrase "data scientist" is recruiter-speak. I have all of the skills required of a "data scientist". I've done the job of a "data scientist". And other than object recognition, I've developed all of the different product features you mention in your comment. You know how I got the skills necessary to do those things? I was trained as a scientist, and there's no such thing as a scientist without data. A person properly trained to analyze data should be able to effectively and fluidly transfer those skills between domains -- otherwise, they're not actually good at it. There's nothing special about internet products that precludes competent people from doing effective data mining on their logs.
I suspect that the real problem here is that "data science" is Internet Hipster for: "someone who has already worked at an internet company, and knows some statistics". Because when it comes right down to it, your average statistician, chemist or physicist is more skilled at data analysis than 99.9% of the "data scientist" types you meet, but they don't easily press the comfort button for hiring managers at consumer internet companies. Why hire the "risky" ex-scientist, when you can hire the guy who claims to be a designer, a software engineer and a statistician?
I agree that data scientist does smell a lot like recruiter/marketing speak. But on the other hand, just because it's a new title doesn't mean it isn't valid. Reducing everyone down to "Scientist" is no more helpful than saying a physicist isn't really a separate job, but just a specialised branch of mathematics. Or for that matter, CS is just a narrow branch of mathematics.
Eventually you have to distinguish new fields from the old, even if they have a lot of commonalities.
Well, yeah...when there's specialized knowledge required for the job (like, say, "physics"), it's obviously a good idea to change the job title.
The problem here is that "data scientist" adds no semantic value above and beyond "scientist". A scientist of data, you say? However will we find such exotic creatures!?
You have a comically narrow definition of statistics, cf. "Statistics is the study of the collection, organization, analysis, interpretation, and presentation of data." (http://en.wikipedia.org/wiki/Statistics)
Right, but you are taking a very web centric view there. What would you call the guy who's work impacted every shopper in a supermarket? There were people doing what is now called "data science" with supermarket loyalty cards, credit cards, frequent flyer programmes, etc looooong before there were "data scientists".
In general, that is objectively bad, although for this particular site it is not as bad as it could be.
Here are the three general problems with submitting print views:
1. For most sites, the print view results in a small font and lines that extend all the way across the page. This makes them hard to read. Sometimes, on a desktop, with a bit of fiddling they can actually be made legible to those of us who are older than 40. On mobile, they are often simply not possible for many of us to read.
This particular site is OK in this regard, as they appear to have actually set the line width and the font size so that it comes out reasonable on the screen. In fact, their print view is quite pleasant to read.
2. The print view often omits comments, sidebar links to related stories, links for sharing, and so on. Some people actually might want to use those.
3. There is often no evident link from the print view back to the normal view. Sometimes you can figure it out by playing with the URL, but sometimes the relationship between the print URL and the normal URL is hard to figure out if all you have is the print URL to work with. Note that the normal page, on the other hand, does generally have a link to the print page, so those who prefer the print page can easily go to it.
For these reasons, in almost all cases the submission should be to the normal page, not the print page. Ideally, the submitter can add a comment that gives the print URL to save time for those who do prefer it.
Note that some sites have an "all on one page" option, that puts the whole thing on one page, but leaves comments, social links, and such. That's the best to use if available.
There's a recommendation I give to people who are writing their online personals ad: If you're sexy, there's no reason to say that you're sexy.
I think the same applies here.
A data scientist is a fancy way of saying a "statistician who can code (should be required in stats programs now anyhow) and who can communicate effectively"
This was ISyE, not stats, and it was 5 years ago, but I was amazed by how much extra work some people would do to avoid having to learn anything but Excel (meanwhile, I was messing around with R and whipping programs up to get better results in less time). This was at a highly ranked engineering program, too.
Based on a few people I've kept in touch with, it seems like it hasn't changed all that much at the undergrad level. The grad level was where the problem sizes and difficulty really forced you to use better tools.
At RIT at least, R and NumPy aren't in the core curriculum. Instead, there is a "Statistical Computing" class which covers SAS. Most students either use Minitab or Excel. Surprisingly (or not), a lot of in-class work is done using a graphing calculator. Of course, that also carries into a lot of the homework.
I wonder, do statisticians actually use graphing calculators to do stats?
I think of my job as essentially that of a statistician (data miner/data scientist/trader/research analyst/whatever you want to call it). I have never used a graphing calculator in my life. If I need to do a quick calculation I open a terminal and boot up R, Python or GHCi, depending on how I'm feeling and how complicated what I need to achieve is.
I remember reading at least 10 articles with nearly the same content during the year. Why are authors so eager to convince everyone of big data's sexiness? Results should speak for themselves. So far Linkedin's Friend Suggest is one of the biggest success stories.
As a "data scientist" I found this article had much more meaningful content than most articles I've read in the past year. It's not just repeating how data science will be big in the next decade, it discusses who data scientists are and how to hire them.
>So far Linkedin's Friend Suggest is one of the biggest success stories.
I don't agree with this. Google is basically a big data sciences company. 'Data science' may be a new term, but it describes something companies have been doing for decades.
Yeah, big data crunching pervades basically everything Google does. I joined Google as a UI SWE (basically a webdev), and find that most of my daily work nowadays involves processing large amount of data to come up with new features. I suppose I made a conscious effort to move back in the stack to more algorithmic back-end work, but even if you stick with UI work, the launch process is so data-driven that you almost need to have a basic familiarity with statistics & data processing.
Looks like a tech company list looking to hire data scientists - essentially a sneaky job advertisement wrapped up in a fluffy HBR (aren't they all?) article written by a consultant who probably wants to get in on the new new thing.
You probably just need better buzzwords (and ideally the background to back it up) -- NoSQL, big data, MongoDB, Hadoop, etc.
I consulted for a client that used those technologies, updated my LinkedIn profile afterwards, and the amount of incoming requests from recruiters and principals has been nothing short of phenomenal. (Anecdotally, 20 InMails in 10 days, of which 14 of them converted into a phone interview with the principal.)
To be pretty honest, prior to my life as a data scientist (and grad school) I was a business analyst. We mined data and threw 10M-100M entries into a MySQL database w/ Rails dashboard and for our non-RT analysis purposes it was tolerable.
There are plenty of data problems out there already warehoused by small-cap and mid-cap firms; I honestly don't see a need to go Web-Scale and all that jazz for its own sake if your use case doesn't need it. There's also shortcuts like sampling to kick the can down the road, but that's another discussion in and of itself.
I think the keyword "big data" ends up being used in even a lot of smaller cases, because everyone thinks what they have is "big data", I'm guessing because they do all genuinely have much more data than they might have a decade ago. But that still varies widely in size; what some companies think is "big data" is still perfectly analyzable, for non-realtime purposes, on one beefy workstation. Yet, because they'd never seen data with tens of millions of rows! before, and it breaks whatever system they were previously using to analyze stuff (SPSS, etc.), what they want to hire is a "big data" person.
COMPLETELY agree. In companies that don't have data as a core competency, "big data" ends up being this business buzzword thrown about because their data is too big for their current set of tools... whether it's R or even Excel or what not.
As a math/stats guy who picked up more programming along the way, I personally think it's MUCH easier to train a DB guy some business sense than it is for a a business analyst to have Hadoop drilled into them. Of course, the downsides of a coder without sufficient savvy are harder to detect than a numbers guy who can't make his program work, and therein lies your problem.
I agree. I wonder if it is possible to get hired as a data scientist as easily if you haven't worked on big data before.
Or, I guess programmers and engineers could start using the big data tools even though they are not needed. Has anyone ran Hadoop on a single (multi-core) machine for this purpose?
I doubt it, simply because it's so easy to find big datasets to work on. It doesn't have to be professional, that's the nice thing about a data driven profession.
Check out http://www.kdnuggets.com/ for links to large data sets to work on and there are also some on amazon.
Also, yes you can certainly run hadoop on a single instance, but once you get into "real big" sizes you'll need a cluster to demonstrate expertise, be it on your local machines at your house or on a set of VPS or EC2 or whatever.
ASK HN? : I'm little confused, can someone please shed some light into this so we can all get a clearer picture. What is the difference between Data Scientist vs Big Data Expert vs Analytics Engineer (Statistics, metrics etc) vs Hadoop Architect vs Machine Learning Expert ? Thanks a lot HN people!
The Insight Data Science fellows program looks awesome, but it is disappointing that only phd candidates and post-docs can apply. There is some irony with the fact that the cover of their brochure uses the facebook friendship visualization done by Paul Butler, who was an undergraduate intern at facebook when he made it.
I've found that a lot of companies are looking for data scientists but many of them have very different ideas of what that means. This makes for some interesting interviews.
I recently moved to SF and am currently interviewing for data science positions - particularly ones involving social networks and applied graph theory - so drop me a line if you know anyone who is dealing with that problem space.
Just checked out your LI profile (fellow data science guy here) -- I think you basically need a bit more work experience or some github code to show yourself off. The big data guys like Google who have best practices, brand, and provide great onboarding should be your focus IMHO.
Quick question: What do you classify as work ex? I do mostly iOS programming, but I've been playing with Hadoop + the commoncrawl.org crawl data. Basically, I guess, what level of stats do you need to be comfortable with to call yourself a data scientist?
Following Gladwell's 10000 hour rule, I would say you could probably call yourself a data science after 1000+ hours experience working with datasets successfully. As far as the math goes you should be able to do regression analysis, you don't need to know tons of stats but you do need to know stats and probability essentials (first few classes at a good school) deeply. I like this Wikipedia entry on "mathematical maturity": http://en.wikipedia.org/wiki/Mathematical_maturity; apart from writing proofs, it is very relevant.
At the end of the day, analytics is measured by effectiveness and appropriateness, not complexity. Simple regressions will do fine, but the "art" is to choose the right questions to ask. Typically if you're in a business setting that boils down to efficiency problems and maximizing time/money/happiness/etc. Dealing with these real-world problems = work exp.
Thanks, I'm trying to find relevant ways to get that experience and working on some more sample code as well. What would you consider an "entry level" data science job?
Question for HN: I'm graduating this Spring with majors in Systems Engineering and Physics, and I want to work as a data scientist, preferably at a startup. What can I do to position myself for such a job? If any of you work in the field and are willing to provide some 1-on-1 advice, please shoot me an email - mail@philipithomas.com
From my experience, you will spend most your time finding, collecting and cleaning data. Doing anything super algorithmically interesting will be rare. Get familiar with the Unix shell + Python (or similar language). The shell will save you tons of code. awk/sed/cut and friends are very fast for cleaning data (in development time and runtime). And shell scripts are good for grabbing things from different systems.
What's the Toronto scene like? I'm finishing up a PhD at NYU this year and was planning on breaking in to the field after graduating. NYC is obviously a great place to be but for relationship reasons I was thinking of moving to Toronto (which is honestly a nicer city anyway much as I love NY), but a cursory inspection suggests a lot less demand for data-loving jobs. I would love to be mistaken though; am I?
I wonder if one of the goals of a good Data Scientist is also to be not too accurate, lest the product create an eerie feeling among users! (remember the Target pregnant girl incident?!)