Hacker News new | past | comments | ask | show | jobs | submit login

I wonder how it compares to the reddit summarizer bot which is absolutely amazing. It frequently is most upvoted comment on long news articles.



Summarizing news articles the way the bot does isn't really that difficult. You face the task of picking 5 sentences out of 20. Since journalists write in a very compact style, the result will almost always look decent. You just follow a few simple rules like:

  - Length of sentence without stopwords
  - Distribution of words across the article
  - Words shared with the title
  - Position in the article
  - Position in the paragraph
  - Does the sentence address the subject in third person? (avoid)
  - Does the sentence contain direct speech? (avoid)
Stuff like this. It's a bit of a Mechanical Turk, really.


reddit bot makes some simple but intricate statistics for whole sentence extraction.

berkley's summarizer takes syntax trees of sentences and cuts off unimportant adjectives (or these unimportant brackets, or comma separated explanations etc.) directly from sentences (since operations are done on syntax trees the shortened sentence is gramatically correct, if parsed syntax tree is correct).

it also uses coreference resolution. (for example, if you were to take the last sentence without context you wouldn't know what/who "it" is, they build up a map of named entities [name - Berkeley Document Summarizer] and replace any pronouns (if necessary) with the names to which they refer.

summarizing quality should be much better than the reddit bot.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: