Hacker News new | past | comments | ask | show | jobs | submit login
The Berkeley Document Summarizer: Learning-Based, Single-Document Summarization (github.com/gregdurrett)
195 points by fitzwatermellow on Oct 2, 2016 | hide | past | favorite | 32 comments



Anyone else work with NLP but unlikely to try this tool simply because it's in Java?

Do any Python NLP engineers integrate Java tools in their setup, e.g. via Jython? How does that work for you?


We work with NLP and Java. Stanford NLP and apache Spark both run on Java/Scala and so does hadoop for HDFS. Why the hate against Java?


Startup time and memory footprint?


Its hdfs so distributed across machines. Startup time is not a problem for us since the service seldom shutsoff


I work with NLP and do not plan to try this tool, though, not because of Java, but because I'm not interested in a summarizer right now.

Our current production back-end doing NLP stuff is written in Java and most third party libraries it uses are also written in Java. It was initially written in Python, but at one point we realized that most libraries we use are in Java and Python is just moving data between them. The choice of these Java libraries wasn't driven by any love towards Java either. By the time they were just a fair bit more advanced both in feature set and performance than their Python counterparts. One example is Stanford Parser and CoreNLP toolkit--up until Parsey McParseface it was the most accurate parser, and CoreNLP toolkit had more features (that interested us) than Python's NLTK.


In the end, language and libs don't matter as much as the actual relevancy of the algorithmic methods. This is where the real innovation and invention occurs, in the ability to mimic human cognition as closely as possible. This is also why side-by-side comparisons are the ultimate litmus and why there is now a separation happening between companies that regurgitate voyeuristic ideas (geocities, friendster, myspace, facebook-aol, snapchat, twitter-majordomo) verses companies that truly invent and innovate hard-to-duplicate initial algorithmic solutions (Google, SENS, Buck Inst., Human Longevity, SpaceX). It's like fake AI vs a path toward some semblance of real AI.


I do a lot of NLP stuff and the Python/Java divide is definitely a problem right now. I'm happy with both languages, but generally prefer Python. A lot of the state-of-the-art tools, especially the Stanford stuff, is in Java, while many of the more modern and experimental stuff is in Python. In theory it should be really easy to use both, but in practice I've always had to deal with bugs in the gap between the languages, whether Jython or OS level integration. So currently, yes, there are some new tools (including this one) that I would definitely try if they were written in Python and I could just do a pip install + import, but never get round to testing out because like this one they are more clunky to set up and require Java.


Would be nice to work with this in JRuby.


Well this code is in Scala, not Java. Python doesn't have a library like the Stanford Datetime parser that can convert a phrase such as 'day after tomorrow' to the actual date. I understand that Writing Java is hard simply because of the amount of code you have to write. Scala is a good compromise.


On the contrary, last time I needed to parse natural dates in Python I got confused because there are dozens of libraries that do exactly this. One example:

    >>> from parsedatetime import Calendar
    >>> cal = Calendar()
    >>> cal.parse("The day after tomorrow")
    (time.struct_time(tm_year=2016, tm_mon=10, tm_mday=3, tm_hour=9, tm_min=0, tm_sec=0, tm_wday=0, tm_yday=277, tm_isdst=-1), 1)
Yes I've used Java equivalents before and they're impressive, but I don't think Python is inferior in this regard. For me, the advantage of Python is mainly being able to easily run small sections of project code in an IPython shell while developing that make the development workflow so much faster and more pleasant, rather than the quantity of code.


Thanks for pointing this out. We missed this during our comparison studies.


Looks like the output I pasted is actually wrong though so maybe I spoke too soon.


would be cool if projects like these could show some examples of what they are capable of ... What results i should expect.


From the linked page:

See http://www.eecs.berkeley.edu/~gdurrett/ for papers and BibTeX.


I wonder how it compares to the reddit summarizer bot which is absolutely amazing. It frequently is most upvoted comment on long news articles.


Summarizing news articles the way the bot does isn't really that difficult. You face the task of picking 5 sentences out of 20. Since journalists write in a very compact style, the result will almost always look decent. You just follow a few simple rules like:

  - Length of sentence without stopwords
  - Distribution of words across the article
  - Words shared with the title
  - Position in the article
  - Position in the paragraph
  - Does the sentence address the subject in third person? (avoid)
  - Does the sentence contain direct speech? (avoid)
Stuff like this. It's a bit of a Mechanical Turk, really.


reddit bot makes some simple but intricate statistics for whole sentence extraction.

berkley's summarizer takes syntax trees of sentences and cuts off unimportant adjectives (or these unimportant brackets, or comma separated explanations etc.) directly from sentences (since operations are done on syntax trees the shortened sentence is gramatically correct, if parsed syntax tree is correct).

it also uses coreference resolution. (for example, if you were to take the last sentence without context you wouldn't know what/who "it" is, they build up a map of named entities [name - Berkeley Document Summarizer] and replace any pronouns (if necessary) with the names to which they refer.

summarizing quality should be much better than the reddit bot.


How does it compare with the summarizer in Microsoft Word (removed around 2010)?


I sometimes tell people about this (e.g. students who need to shorten their papers) and nobody believes this was a thing. Of course, the results were pretty poor back then. I guess there should be better tools by now. Side-by-side comparison anyone?


I personally like context-controllable summarization http://www.lexcognition.com/summarai/


So how does this compare to the stuff Google was doing with document summarization? Is the content unique, meaning does it summarize using brand new words?

It's unclear but still seems Really promising.


Just going by the GitHub readme and the corresponding paper this tool does not generate new words, popularly known as the abstraction summary. This is doing an extraction task followed by a syntactic compression task.

Pages 3,4 of the paper http://www.cs.utexas.edu/~gdurrett/papers/durrett-berg-klein...


Google has been to busy trying to lift methods from Berkeley Lab and passing it off as their own, in particular Tomas Mikolov https://www.kaggle.com/c/word2vec-nlp-tutorial/forums/t/1234...


You need a better argument than "somebody else has been working on word vectors before Google"

No shit.


That's not what was said. It's how the feature attributes in the vectors are constructed, scored and ranked in addition to the calculations used to score vectors for similarity.


I like this approach, its interesting to see that coherence factoring into the summation amongst other things.


Could HN run this automatically on every post? That would be cool :)


Does anyone have an example of a summary done with this library?


If you look at the paper it has one annotated: http://www.cs.utexas.edu/~gdurrett/papers/durrett-berg-klein...


I don't think it does? Figure 4 is an example of a manually written summary in the dataset.

There are some examples of how the sentence compression works, but no complete automatic summaries that I can see.



Hey, this is the guy whose dissertation talk I went to several months ago, which is on the same topic!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: