TextTeaser – An automatic summarization algorithm

nichodges · on Oct 12, 2013

Really interesting. I tried it with an article I wrote for Wired - both to see how it handled lengthy content with multiple points, and also how it handled my 'loose' writing style.

Really surprised with both the quality and succinctness of the result: http://www.textteaser.com/s/t1bNud

Well done.

(Also to the project owner - copying the link is borked in Firefox, I had to type it out manually)

drakaal · on Oct 12, 2013

A comparison: of Summaries of http://www.wired.co.uk/news/archive/2013-08/22/filtering-the...

Stremor's TLDR: Pointless reeling off the numbers. The challenge each of us now faces is a brand new one. Our filters were once the media, our friends, and our families. We were aware of, and understood how our filters operated. EdgeRank isn't something that Facebook users understand. Just 20 tweets out of thousands. We need better filters.

Text Teaser: How do we create a balanced diet of content with so much junk being thrown at us? Now, large media organisations create mountains of content, then track our reading habits and online behaviour in order to build a profile of us. A set of favourite news groups; a list of RSS feeds; a well-curated bookmarks folder; these are all filters we once built ourselves. The less we understand our filters, the more we will come to accept that the world they present us with is true. The more control we have over our filters, the more we can understand what we're not seeing.

Text Teaser goes over 350 Characters which is the Established People can't sue you for stealing it limit... So also weigh that when deciding which you like better.

keithpeter · on Oct 12, 2013

Yes, tried it on a couple of my own pages. Seems to work ok.

Who is going to be first to couple this library with an RSS feed reader and mailer so that I can get auto-generated summaries of recently written articles sent to my blackberry?

btbuildem · on Oct 12, 2013

bitofnews.com already did it

Personally, I find the "email me news" scheme obnoxious. I get enough emails as is it. Would prefer to see a portal that shows summaries of all the news and lets the user explore.

keithpeter · on Oct 12, 2013

bitofnews.com is interesting but I want to be able to specify which sites to summarise.

hboon · on Oct 12, 2013

I browsed the article and wanted to look at what else you have written but didn't find any references in the URL: http://www.wired.co.uk/news/archive/2013-08/22/filtering-the.... Not even your name. Is that normal for Wired?

MojoJolo · on Oct 12, 2013

Thanks!

But I did not get what you mean by "copying the link is borked in Firefox". What link are you talking about? :)

nichodges · on Oct 12, 2013

Once I was at the screen showing me the summary, in the right hand 'Share' column - I can't copy the text in the link, image, or embed fields. (FF4.0 on OSX 10.8.5)

MojoJolo · on Oct 12, 2013

Thanks. Will check it out. :) Never tried it on Firefox.

mcpherrinm · on Oct 13, 2013

I really hope you mean Firefox 24, not 4.

MojoJolo · on Oct 12, 2013

If you guys want to try out TextTeaser, you can check out the website (http://www.textteaser.com/). Or try the API via Mashape (https://www.mashape.com/mojojolo/textteaser).

yelnatz · on Oct 12, 2013

Are you using LSA?

MojoJolo · on Oct 12, 2013

What do you mean by LSA? Are you referring to this: http://en.wikipedia.org/wiki/Latent_semantic_analysis

If it is, I'm not using it. :)

sinzone · on Oct 12, 2013

You mean do you support SLA?

amatsukawa · on Oct 12, 2013

No, I believe he actually meant LSA = latent semantic analysis, which is an algorithm used to extract topics.

I am also curious about how the NLP/ML parts are implemented, as it's claimed by the README on github. Briefly scanned the code but didn't really spot it.

MojoJolo · on Oct 12, 2013

It's more of statistical NLP and a bit of machine learning. The algorithm can be found here: https://github.com/MojoJolo/textteaser/blob/master/src/main/...

cantrevealname · on Oct 12, 2013

You can cut and paste some text here to try it out:

http://www.textteaser.com

If you leave the "Title" field empty and click "Summarize", nothing happens -- which I thought was very confusing. You have to fill in something for the Title.

MojoJolo · on Oct 12, 2013

I require the title because I need it for the algorithm.

pests · on Oct 12, 2013

That explains the many tests I just did with random copy and pasted articles. I just typed gibberish into the title. I mean, not all texts need to have a title.

MojoJolo · on Oct 12, 2013

Maybe in the future I can improve it without requiring the title. It may produce good results to other type of texts, but right now, TextTeaser is meant to be used for news articles.

6ren · on Oct 12, 2013

Since the title (headline) of a news article usually summarizes it, TextTeaser arguably is less an article summarizer than a headline expander...

EDIT it would be nice, from a UX POV, to request the title if it's missing, rather than silently deleting the story... also, you might emphasis the importance of it (because it doesn't seem important at all). Perhaps just labeling it as "headline" or "subject" instead of the generic "title" would help.

diminish · on Oct 12, 2013

Congratulations for open sourcing the library. Do you think, it could generate a title as a one sentence summarization?

keithpeter · on Oct 12, 2013

Looks like the algorithm is giving weight to h1 and h2 tags in the page markup having just tried it on some of my pages. Is that true or am I imagining it?

If so, I'll have to provide more literal subheadings!

MojoJolo · on Oct 12, 2013

Nope. :) You are imagining it.

keithpeter · on Oct 13, 2013

Fair enough. I must have used more relevant subheadings than I thought!

BjoernKW · on Oct 12, 2013

This sure looks interesting. What are the theoretical foundations of this? As for SBS I found this paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.222... . I couldn't find anything relevant on DBS, though.

At a cursory glance the algorithm seems like a variation Luhn's abstract algorithm.

drakaal · on Oct 12, 2013

This was also the foundation for Summly. The problem is that it is flawed. It doesn't take in to account emotion, or emphasis.

What is the most important sentence in this:

Drakaal is a poopy head. He often posts to hackernews and calls people an idiot. When this happens I get mad.

The Premise is "Drakaal is a poopy head" so that is the most important, but it doesn't inform the user. What he does is the most telling of the sentences, but with the "He" as the first word you can't actually make sense of the content with out the prior sentence. The last sentence is the least important for the understanding of the content.

It is important to know that sentence number 2 is the most informative, but that to figure out what it means requires Sentence 1.

When computing the results of a summary you have to weigh sentence dependencies, density of information, amount of emotion expressed and number of characters available to you.

And Keywords aren't enough, you need noun entities and the ability to tell the relationships of words so that you know "Cars, Trucks, and Automobiles" are all the same concept in many contexts.

MojoJolo · on Oct 12, 2013

You are spot on on the paper that I referenced. As you can see in 4.3 which is in page 3, the paper mentioned two algorithms for sentence selection. These are Summation-Based Selection and Density-Based Selection. Which is SBS and DBS respectively.

agibsonccc · on Oct 12, 2013

What made you pick this representation in particular? I'm kind of curious what different kinds of algorithms you might have looked at.

Summarizing only blogs posts seems a bit limiting to me. (Btw, I'm not trying to be negative, congrats on your success! texteaser looks great!)

I implemented a custom version (mainly changed the scoring scheme to include TF/IDF of words for initialized scoring) of TextRank and loved it.

The main thing I liked about it was how general it was. Words are nodes and sentences are vertices. Then you basically use pagerank to rank the sentences according the graph representation.

[1] http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Mihalcea.pdf

MojoJolo · on Oct 12, 2013

Hi, I focus on blog posts because I don't want it to be broad. This was because TextTeaser is my research for my graduate studies. And having a broader research means harder to accomplish. But it doesn't mean it can't be used to other type of text. It can still be used. It's just optimized for news.

I'm a little bit familiar with TextRank because I stumbled upon it when I'm doing my research. I also read several algorithms but forgot what they are called.

agibsonccc · on Oct 12, 2013

Ahh very cool! Thank you for the insight. I could see where that would be applicable then. Using comments as features is a very neat concept.

News is the most broadly applicable use for this so leveraging that isn't a bad thing. There's always a trade off of broad applicability vs overfitting for a particular case to get better results.

Thanks for the insight! Again great work.

aswanson · on Oct 12, 2013

Any suggestions on papers for Luhn's abstract algorithm? I hadn't heard of it before.

BjoernKW · on Oct 12, 2013

There you go:

https://text-analysis.googlecode.com/files/luhn58.pdf‎ http://dl.acm.org/citation.cfm?id=1662360

nemo1618 · on Oct 12, 2013

Hey guys, I made a userscript at HackMIT last weekend that adds article summaries to the HN front page. It doesn't use the TextTeaser API (for the time being, at least) but the summaries seem to come out about the same anyway.

Check it out here: https://github.com/lukechampine/ADHN

PLejeck · on Oct 12, 2013

From a quick test, it seems to treat almost every bit of content on a page equally, even elements which are clearly smaller and next to an image.

Might I recommend taking CSS styles into account? Large text is usually headlines, <strong> text is usually important, and darker greys generally suggest a side comment. Would be much easier if everybody used <aside> and <h1> but even in 2013 that's too high an expectation.

MojoJolo · on Oct 12, 2013

You are right, I'm not taking account of HTML tags. It is because I extract the text beforehand using Pythoon Goose. In that sense, only the text will be feed in the algorithm without any HTML tags.

nubela · on Oct 12, 2013

Try https://github.com/visualrevenue/reporter :) I'm looking at your service now and it is really massively awesome. Can I ask, if you are considering monetizing it, or going the venture-path (boo)? I ask this because I'm curious on the viability of using your service/library on a long-term project.

ismaelc · on Oct 12, 2013

He's monetizing it as an API here https://www.mashape.com/mojojolo/textteaser

hnriot · on Oct 12, 2013

if you paste this thread into the demo you get not very encouraging results. I haven't looked at the code but I suspect they find the sentences with most (cosine) similarity to the title and bias towards early sentences.

results:

- Hacker Newsnew | threads | comments | ask | jobs | submit hnriot (1618) | logout upvote TextTeaser – An automatic - summarization algorithm (github.com) - If you leave the "Title" field empty and click "Summarize", nothing happens -- which I thought was very confusing. - reply upvote downvote MojoJolo 1 hour ago | link I require the title because I need it for the algorithm. - not a criticism of textteaser (which was behind this excellent project https://news.ycombinator.com/item?id=6498625), - reply upvote downvote wikiburner 3 minutes ago | link Is this a well known text summarization tool?

MojoJolo · on Oct 12, 2013

Hahaha. It sucks. The algo is not meant for this kind of websites. Try out news articles! :)

draugadrotten · on Oct 12, 2013

http://www.textteaser.com/s/T4PQ1s

Not sure it captures the essence of the source article's argument. The fourth bullet makes no sense at all. I can't see it being useful at this stage.

Can you provide us with a list of articles that it manages to summarize properly?

It's great to see the project on github though. I look forward to seeing it improved over time. Thanks for sharing.

drakaal · on Oct 12, 2013

But it doesn't do well with sentence disambiguation. And the summaries aren't particularly good.

This isn't even on Par with Summly which was pretty hacked together.

https://www.mashape.com/stremor

Creates MUCH better summaries ans comes with all the stuff to separate Content from the web template.

If you contact Stremor there is also a version that scores every sentence for importance on a scale of 0-100 and maintains HTML so that you can return summaries of any length and still have images and other styling maintained.

( http://www.tldrstuff.com has several ways you can play with the tech )

andrewcooke · on Oct 12, 2013

https://github.com/MojoJolo/textteaser/blob/master/src/main/...

you post non-idiomatic(?) scala in a comment to explain what you are doing, i think? not a criticism of textteaser (which was behind this excellent project https://news.ycombinator.com/item?id=6498625), but seems to raise questions about the language...

shoo · on Oct 12, 2013

that patch of code could be restructured a bit to make it more readable, here are a couple of small suggestions that jump out at me:

1. perhaps `.reduceLeft(_ + _)` could be replaced with the use of a `sum` function or method (assuming one exists in scala?)

2. if the `topKeywords` collection returned a default value with a `.score` of 0 when queried with a key it doesnt contain, the headOption getOrElse null match null would not be necessary.

e.g. in python it might look something like this:

    Keyword = namedtuple('Keyword ', ['score', ...etc...])

    top_keywords = defaultdict(lambda : Keyword(score=0, ...etc...))

    def sbs(words):
        if words:
            return (1.0/len(words)) * sum(top_keywords[w].score for w in words)
        else:
            return 0.0

(apologies for making superficial comments about the code. the algorithm itself certainly seems interesting)

MojoJolo · on Oct 12, 2013

You are right. I think Scala is a good language and handle functional programming well. But the code is too abstracted that even me might not get what it is doing. I just placed as a reminder for me. And also for everyone else to easily get what that piece of code is doing.

fedesilva · on Oct 12, 2013

Curiously, I find easier to read the scala code than the commented pseudo code. I've been experiencing this a lot lately. It' seems I am loosing my ability to reason about code that loops explicitly.

One minor nitpick that can be of help when dealing with tuples: A partial function ( {case xxx => yyy} ) is a Function1 so you can use it with map and filter. This way you can deconstruct tuples into names and avoid using _1, _2, etc. { case (name, value) => blah }

https://github.com/MojoJolo/textteaser/blob/master/src/main/... could be made more readable by giving names to the tuple elements.

Thanks for publishing this code. It yields impressive results.

srin · on Oct 12, 2013

I've been interested in how it works since I first saw it! Can't wait for the documentation. Though I think I'm going to learn scala just to read through this. Thanks for putting it up!

wikiburner · on Oct 12, 2013

Is this a well known text summarization tool? I hadn't heard of it before this post.

MojoJolo · on Oct 12, 2013

Hi, I don't want to say it's well known. But it got in HN once in a while.

https://news.ycombinator.com/item?id=6498625

https://news.ycombinator.com/item?id=6049873

In TC:

http://techcrunch.com/2013/10/06/textteaser-lets-developers-...

wikiburner · on Oct 12, 2013

Yep, pretty well known!

Anyway, thanks for open sourcing - really cool.

ape4 · on Oct 12, 2013

In most news articles the first paragraph is already a summary.

ytadesse · on Oct 12, 2013

Jolo, this is great! What is the implication for your API now? I notice that it's still available on Mashape and you're still charging a fee for it.

MojoJolo · on Oct 12, 2013

Hi! I will still retain the API in Mashape. That is for the developers that do not want the hassle to deploy it in their own servers. On the other hand, the open source code is for devs to check out the algo, hopefully improve and contribute to TextTeaser. If they want to use it and deploy it on their own, they are free to do so. :)

Think MongoHQ for MongoDB.

ytadesse · on Oct 12, 2013

Great! You're a good man.

cheshire137 · on Oct 12, 2013

Really wish there was a way I could test the API without giving my CC info to Mashape. Even for the Freemium plan, I can't do a single request without giving payment info. Thus, I'm skipping this API, despite how cool it looks.

Edit: the main TextTeaser web site is down right now, which is why I went straight to the API to test.

ismaelc · on Oct 12, 2013

Hey, you can contact mojojolo in Mashape through the Contact Now button at the bottom of this page https://www.mashape.com/mojojolo/textteaser#!pricing

He can set up a limited free private API for you to test. Let me know if you have questions about this process - chris@mashape.com

natch · on Oct 12, 2013

Cool.

What is the structure of the sent.model file inside the corpusEN.bin zip archive?

It's a strikingly small file for something called corpus. Say I have a larger corpus, or a corpus in a different language, how would I go about building one of these sent.model files with more data?

MojoJolo · on Oct 12, 2013

The corpusEN.bin file is the training data provided by OpenNLP which I used to split sentences (http://opennlp.sourceforge.net/models-1.5/). It's not the training data used for summarization.

9diov · on Oct 12, 2013

Can you provide a bit more details about your approach? Are you using machine learning or just simple scoring based on some heuristics? From the look of the source code it seems to be the latter to me.

MojoJolo · on Oct 12, 2013

It's mostly statistical (simple scoring). But as you can see in this lines of code: https://github.com/MojoJolo/textteaser/blob/master/src/main/... I keep track of the keywords used by the blog and category before. Through it, TextTeaser employs a little bit of machine learning to improve the quality of the results.

milkmanjr · on Oct 14, 2013

Very cool. What made you open source it?

Also I remember it being a bit pricier to use the API. What made you to go down on price? I'm tempted to hook this up to my app right now.

unknownian · on Oct 12, 2013

There's a certain euphoria I get when I see a different color on GitHub than the normal ruby, python, shell, and JS.

MojoJolo · on Oct 12, 2013

Just checked it in Github. It's 100% Scala. :)

iamtechaddict · on Oct 12, 2013

Why you used scala to build when python could have been an good alternative ?

MojoJolo · on Oct 12, 2013

I'm seeing good things about Python and NLTK. But back when I develop the core algo of TextTeaser (few months ago), I still don't know Python.

Right now, the TextTeaser website is coded using Python and Flask.

kkthnxbye · on Oct 12, 2013

It shouldn't matter what language the OP decided to use, as long as it allows him/her to do whatever he/she set out to do in the first place.

SkyMarshal · on Oct 12, 2013

Why use Python when Scala is a good alternative?

iamtechaddict · on Oct 12, 2013

because NTLK is a very strong toolkit for natural language processing and i haven't found anything comparable in scala.

rspeer · on Oct 12, 2013

NLTK's strength is the clarity and flexibility of its code, for when you're experimenting with various processes and representations to find out what works.

If you have a single NLP model that already works, you wouldn't gain anything from rewriting it using NLTK. It would probably just get slower, because you're adding abstractions that you've already shown you don't need.

I say this as a fan of and (once) contributor to NLTK.

SkyMarshal · on Oct 12, 2013

You can see from the source he's already using Apache OpenNLP. Scala is 100% interoperable with Java libs, so you have the entire Java ecosystem available, not just Scala code.

iamtechaddict · on Oct 13, 2013

ya i saw that I'm not familiar with OpenNLP. lemme have a look it might solve my problem, I'm also starting a nlp project using scala :)

tel · on Oct 12, 2013

Why use Python instead of Scala?

nwq · on Oct 12, 2013

How would I go about using this directly from Python, os.system calls?

level09 · on Oct 12, 2013

you can either rewrite it in python, or use unirest/requests to summarize text using the API.

dangerlibrary · on Oct 12, 2013

Why would "Philippine" be hard-coded in as a stop word?

MojoJolo · on Oct 12, 2013

Didn't manage to remove it.

I created a news reader for Philippine news (http://www.readborg.com/) using TextTeaser. The word Philippine appears most of the time and I decided to make it as a stop word. Forgot to remove it in the stop words.

jgalt212 · on Oct 13, 2013

works good enough for me. I'll give you $15MM for it.

-Marissa

MojoJolo · on Oct 13, 2013

This made me laugh. And $15m is big enough for me.