Author here. That's definitely my bad and not an intended user experience. The text was initially meant as a transcript accompanying presentation slides. I've compressed the images so it should be at least slightly better now.
The good news is that this already exists in the Redis search module [1], which allows you to do similarity search against indexed embeddings, among other features, and offers comparable performance against other ANN libraries [2], depending on your performance criteria.
I've been using it for a side project to do semantic search on books[3] and have been really happy with its performance. (Not affiliated with any of this, was mostly interested in exploring existing well-performing, fairly standard tools with low latency)
Bit of a side note on Redis Search. While performance is pretty good, it is not production quality yet, mostly around monitoring indexing errors and quirky query results. Silent indexing errors pretty much make it impossible to use it in production, especially for structured data.
Author here. The primary difference is that Confluence is more like a reference manual. A P2 is more like a conversation and a record of decisions made, along with all the context for those decisions, since each post has a threaded comment section where people can reply to each other. It's not better or worse, but has a slightly different use-case in that it's essentially like email threads or RFCs that anyone in the company can read.
Editor here. Thank you for sharing your memories. And, for the note.
Some details about the origin: the vignettes were originally English-language comments in a private Facebook group.
It was a huge team effort to get them all collated, organized, and to check and re-check attribution and consent to share. We are so happy that we are able to share these collective memories and eyewitness testimonies, particularly outside of Facebook's walled garden.
Editor here. These were originally written in English, so we have plans to translate to Russian, which might take a while since we only have one fully-qualified translator. Help is welcome! Shoot me an email (included in my profile).
This is a rather interesting stance to take for a publication whose article quality has degraded and ad size has increased (thirty-eight trackers, including two from Facebook, detected by my adblocker when I tried to access the article) over the past five years to the point where I refuse to read them.
If you take a look at the homepage through the Wayback archive as it used to appear in 2005[1], 2011[2], and today, you'll see how content disappears and click-baity headlines rise over time.
The Atlantic is very much a part of the problem of "the race towards the bottom" the author describes, and instead of having a discussion about how to fix it and maybe trying different revenue models, it continues to un-ironically have share and tweet buttons at the top of this article.
Publishing pieces like this is one of the few ways the editorial side can put public pressure on the ad sales / revenue side to change or improve their behavior, particularly if other attempts have been ignored.
Writers may not be able to bite the hand that feeds them, but they can nibble.
Therefore, kudos to Atlantic for publishing the article, despite being immersed in that same Prisoner's Dilemma, or more specifically, Nash Equilibrium/first mover's dilemma.
Came across this yesterday. Can someone (maybe poster?) talk about when you would use this versus something like scikit-learn or any number of R libraries? Is the goal simply to have all machine learning in Java so it can be productionized easier?
The project homepage says "Data scientists and developers can speak the same language now!". So it is surely easier to producitionize a ML project without rewriting the algorithms after the data scientists work out the model with R or Matlab.
I don't know that that's necessarily true. The most recent StackOverflow survey[1] shows a difference of 8%, which is not an overwhelming majority. Granted, that's not an unbiased sample size, but I think the OP above is correct...more data scientists use Python than Java.
So anyone wanting to use this library would have to think about tradeoffs: Are the efficiencies lost in data scientists learning to use Java for modeling worth the efficiencies gained in putting a model in production? For some, the answer may be yes, for some no.
Perhaps it's just my selective biases, but I feel like I am reading about more notable people in the tech industry speaking out in favor of privacy over the past couple months than even when the Snowden revelations happened in 2013.
Could this possibly be a tipping point for adtech as a revenue model? (Although I've been reading about it for almost a year now [1], [2], [3]) I'd like to hope so, and am also curious: how invasive can companies be before consumers start to push back?
Data scientist here. It is 100% possible to do things with kids, but you really have to be motivated to do it, AND it really helps if you have a support system of other people to help take care of your child. I wrote about the dangerous deception we have in American culture, and particularly tech culture, of people who "have it all," but in reality have a bunch of help in the background here[1] and here[2].
If you work full-time and you want to go above and beyond, you're essentially working three days: work, before school + after school, and then your third day is learning or development.
Whatever that means for you in terms of reshuffling energy and other commitments will vary on your personality, energy level, etc, etc.
When you have a small child, it is extremely hard to multitask. So I wait until she is asleep. All after work time and weekends are for her.
Here is the way my schedule works: I pick her up from daycare, do dinner, playing, and then she goes to bed. I then take half an hour break, and delve into whatever I have going on, for about three hours.
I'm currently taking a Java class, writing technical blogs, and working out some Python. So I'll usually do an hour of reading/Java homework, then start a blog post, then finish off with whatever else I was working on.
Over the past three weeks, I developed this talk on big data[3]. That was probably the hardest because I needed a lot of time to write the code, test the code and concentrate, and all of my energy was just sapped.
All of this is to say that you can do it. For me personally it takes a lot of reshuffling and work and giving up things, but that's how kids work.
This is a fantastic article for intermediate beginners. On HN, everyone is a senior data scientist working with Spark and Keras and Tensorflow and deep learning.
In the real world, there is a huge chasm of difference between people just learning Excel and developers, not many people even understand why you would switch away from the former when it's so convenient, which is why the difficulty v.s. complexity chart is so great, and may actually speak to people in an approachable way.
There are a lot of tutorials for how to do hard things and how to do easy things, but not a lot for how to think of the hard things in terms of the easy things, and this falls in that category. Another good book on this topic is Data Smart by John Foreman, where he goes over basic data science skills in Excel.
> This is a fantastic article for intermediate beginners. On HN, everyone is a senior data scientist working with Spark and Keras and Tensorflow and deep learning.
Although there is a ML/AI selection bias in HN, there are certainly a lot of people on HN who fall into the intermediate/beginner category (there is a lot of demand for R tutorials which I have been working on), although I would argue that dplyr can legitimately be used at the advanced level. And certainly Keras/Tensorflow is overkill for common business problems.
There are definitely a lot of people who fall into those categories, but the articles/links give the impression that everyone is senior. Which is why it's great when articles like this come around.