Summarizing news articles the way the bot does isn't really that
difficult. You face the task of picking 5 sentences out of 20. Since journalists
write in a very compact style, the result will almost always
look decent. You just follow a few simple rules like:
- Length of sentence without stopwords
- Distribution of words across the article
- Words shared with the title
- Position in the article
- Position in the paragraph
- Does the sentence address the subject in third person? (avoid)
- Does the sentence contain direct speech? (avoid)
Stuff like this. It's a bit of a Mechanical Turk, really.
reddit bot makes some simple but intricate statistics for whole sentence extraction.
berkley's summarizer takes syntax trees of sentences and cuts off unimportant adjectives (or these unimportant brackets, or comma separated explanations etc.) directly from sentences (since operations are done on syntax trees the shortened sentence is gramatically correct, if parsed syntax tree is correct).
it also uses coreference resolution. (for example, if you were to take the last sentence without context you wouldn't know what/who "it" is, they build up a map of named entities [name - Berkeley Document Summarizer] and replace any pronouns (if necessary) with the names to which they refer.
summarizing quality should be much better than the reddit bot.