I work with NLP and do not plan to try this tool, though, not because of Java, but because I'm not interested in a summarizer right now.
Our current production back-end doing NLP stuff is written in Java and most third party libraries it uses are also written in Java. It was initially written in Python, but at one point we realized that most libraries we use are in Java and Python is just moving data between them. The choice of these Java libraries wasn't driven by any love towards Java either. By the time they were just a fair bit more advanced both in feature set and performance than their Python counterparts. One example is Stanford Parser and CoreNLP toolkit--up until Parsey McParseface it was the most accurate parser, and CoreNLP toolkit had more features (that interested us) than Python's NLTK.
In the end, language and libs don't matter as much as the actual relevancy of the algorithmic methods. This is where the real innovation and invention occurs, in the ability to mimic human cognition as closely as possible. This is also why side-by-side comparisons are the ultimate litmus and why there is now a separation happening between companies that regurgitate voyeuristic ideas (geocities, friendster, myspace, facebook-aol, snapchat, twitter-majordomo) verses companies that truly invent and innovate hard-to-duplicate initial algorithmic solutions (Google, SENS, Buck Inst., Human Longevity, SpaceX). It's like fake AI vs a path toward some semblance of real AI.
I do a lot of NLP stuff and the Python/Java divide is definitely a problem right now. I'm happy with both languages, but generally prefer Python. A lot of the state-of-the-art tools, especially the Stanford stuff, is in Java, while many of the more modern and experimental stuff is in Python. In theory it should be really easy to use both, but in practice I've always had to deal with bugs in the gap between the languages, whether Jython or OS level integration. So currently, yes, there are some new tools (including this one) that I would definitely try if they were written in Python and I could just do a pip install + import, but never get round to testing out because like this one they are more clunky to set up and require Java.
Well this code is in Scala, not Java. Python doesn't have a library like the Stanford Datetime parser that can convert a phrase such as 'day after tomorrow' to the actual date. I understand that Writing Java is hard simply because of the amount of code you have to write. Scala is a good compromise.
On the contrary, last time I needed to parse natural dates in Python I got confused because there are dozens of libraries that do exactly this. One example:
>>> from parsedatetime import Calendar
>>> cal = Calendar()
>>> cal.parse("The day after tomorrow")
(time.struct_time(tm_year=2016, tm_mon=10, tm_mday=3, tm_hour=9, tm_min=0, tm_sec=0, tm_wday=0, tm_yday=277, tm_isdst=-1), 1)
Yes I've used Java equivalents before and they're impressive, but I don't think Python is inferior in this regard. For me, the advantage of Python is mainly being able to easily run small sections of project code in an IPython shell while developing that make the development workflow so much faster and more pleasant, rather than the quantity of code.
Summarizing news articles the way the bot does isn't really that
difficult. You face the task of picking 5 sentences out of 20. Since journalists
write in a very compact style, the result will almost always
look decent. You just follow a few simple rules like:
- Length of sentence without stopwords
- Distribution of words across the article
- Words shared with the title
- Position in the article
- Position in the paragraph
- Does the sentence address the subject in third person? (avoid)
- Does the sentence contain direct speech? (avoid)
Stuff like this. It's a bit of a Mechanical Turk, really.
reddit bot makes some simple but intricate statistics for whole sentence extraction.
berkley's summarizer takes syntax trees of sentences and cuts off unimportant adjectives (or these unimportant brackets, or comma separated explanations etc.) directly from sentences (since operations are done on syntax trees the shortened sentence is gramatically correct, if parsed syntax tree is correct).
it also uses coreference resolution. (for example, if you were to take the last sentence without context you wouldn't know what/who "it" is, they build up a map of named entities [name - Berkeley Document Summarizer] and replace any pronouns (if necessary) with the names to which they refer.
summarizing quality should be much better than the reddit bot.
I sometimes tell people about this (e.g. students who need to shorten their papers) and nobody believes this was a thing. Of course, the results were pretty poor back then. I guess there should be better tools by now. Side-by-side comparison anyone?
So how does this compare to the stuff Google was doing with document summarization? Is the content unique, meaning does it summarize using brand new words?
Just going by the GitHub readme and the corresponding paper this tool does not generate new words, popularly known as the abstraction summary. This is doing an extraction task followed by a syntactic compression task.
That's not what was said. It's how the feature attributes in the vectors are constructed, scored and ranked in addition to the calculations used to score vectors for similarity.
Do any Python NLP engineers integrate Java tools in their setup, e.g. via Jython? How does that work for you?