Hacker News new | past | comments | ask | show | jobs | submit login
TextTeaser – An automatic summarization algorithm (github.com/mojojolo)
188 points by MojoJolo on Oct 12, 2013 | hide | past | favorite | 78 comments



Really interesting. I tried it with an article I wrote for Wired - both to see how it handled lengthy content with multiple points, and also how it handled my 'loose' writing style.

Really surprised with both the quality and succinctness of the result: http://www.textteaser.com/s/t1bNud

Well done.

(Also to the project owner - copying the link is borked in Firefox, I had to type it out manually)


A comparison: of Summaries of http://www.wired.co.uk/news/archive/2013-08/22/filtering-the...

Stremor's TLDR: Pointless reeling off the numbers. The challenge each of us now faces is a brand new one. Our filters were once the media, our friends, and our families. We were aware of, and understood how our filters operated. EdgeRank isn't something that Facebook users understand. Just 20 tweets out of thousands. We need better filters.

Text Teaser: How do we create a balanced diet of content with so much junk being thrown at us? Now, large media organisations create mountains of content, then track our reading habits and online behaviour in order to build a profile of us. A set of favourite news groups; a list of RSS feeds; a well-curated bookmarks folder; these are all filters we once built ourselves. The less we understand our filters, the more we will come to accept that the world they present us with is true. The more control we have over our filters, the more we can understand what we're not seeing.

Text Teaser goes over 350 Characters which is the Established People can't sue you for stealing it limit... So also weigh that when deciding which you like better.


Yes, tried it on a couple of my own pages. Seems to work ok.

Who is going to be first to couple this library with an RSS feed reader and mailer so that I can get auto-generated summaries of recently written articles sent to my blackberry?


bitofnews.com already did it

Personally, I find the "email me news" scheme obnoxious. I get enough emails as is it. Would prefer to see a portal that shows summaries of all the news and lets the user explore.


bitofnews.com is interesting but I want to be able to specify which sites to summarise.


I browsed the article and wanted to look at what else you have written but didn't find any references in the URL: http://www.wired.co.uk/news/archive/2013-08/22/filtering-the.... Not even your name. Is that normal for Wired?


Thanks!

But I did not get what you mean by "copying the link is borked in Firefox". What link are you talking about? :)


Once I was at the screen showing me the summary, in the right hand 'Share' column - I can't copy the text in the link, image, or embed fields. (FF4.0 on OSX 10.8.5)


Thanks. Will check it out. :) Never tried it on Firefox.


I really hope you mean Firefox 24, not 4.


If you guys want to try out TextTeaser, you can check out the website (http://www.textteaser.com/). Or try the API via Mashape (https://www.mashape.com/mojojolo/textteaser).


Are you using LSA?


What do you mean by LSA? Are you referring to this: http://en.wikipedia.org/wiki/Latent_semantic_analysis

If it is, I'm not using it. :)


You mean do you support SLA?


No, I believe he actually meant LSA = latent semantic analysis, which is an algorithm used to extract topics.

I am also curious about how the NLP/ML parts are implemented, as it's claimed by the README on github. Briefly scanned the code but didn't really spot it.


It's more of statistical NLP and a bit of machine learning. The algorithm can be found here: https://github.com/MojoJolo/textteaser/blob/master/src/main/...


You can cut and paste some text here to try it out:

http://www.textteaser.com

If you leave the "Title" field empty and click "Summarize", nothing happens -- which I thought was very confusing. You have to fill in something for the Title.


I require the title because I need it for the algorithm.


That explains the many tests I just did with random copy and pasted articles. I just typed gibberish into the title. I mean, not all texts need to have a title.


Maybe in the future I can improve it without requiring the title. It may produce good results to other type of texts, but right now, TextTeaser is meant to be used for news articles.


Since the title (headline) of a news article usually summarizes it, TextTeaser arguably is less an article summarizer than a headline expander...

EDIT it would be nice, from a UX POV, to request the title if it's missing, rather than silently deleting the story... also, you might emphasis the importance of it (because it doesn't seem important at all). Perhaps just labeling it as "headline" or "subject" instead of the generic "title" would help.


Congratulations for open sourcing the library. Do you think, it could generate a title as a one sentence summarization?


Looks like the algorithm is giving weight to h1 and h2 tags in the page markup having just tried it on some of my pages. Is that true or am I imagining it?

If so, I'll have to provide more literal subheadings!


Nope. :) You are imagining it.


Fair enough. I must have used more relevant subheadings than I thought!


This sure looks interesting. What are the theoretical foundations of this? As for SBS I found this paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.222... . I couldn't find anything relevant on DBS, though.

At a cursory glance the algorithm seems like a variation Luhn's abstract algorithm.


This was also the foundation for Summly. The problem is that it is flawed. It doesn't take in to account emotion, or emphasis.

What is the most important sentence in this:

Drakaal is a poopy head. He often posts to hackernews and calls people an idiot. When this happens I get mad.

The Premise is "Drakaal is a poopy head" so that is the most important, but it doesn't inform the user. What he does is the most telling of the sentences, but with the "He" as the first word you can't actually make sense of the content with out the prior sentence. The last sentence is the least important for the understanding of the content.

It is important to know that sentence number 2 is the most informative, but that to figure out what it means requires Sentence 1.

When computing the results of a summary you have to weigh sentence dependencies, density of information, amount of emotion expressed and number of characters available to you.

And Keywords aren't enough, you need noun entities and the ability to tell the relationships of words so that you know "Cars, Trucks, and Automobiles" are all the same concept in many contexts.


You are spot on on the paper that I referenced. As you can see in 4.3 which is in page 3, the paper mentioned two algorithms for sentence selection. These are Summation-Based Selection and Density-Based Selection. Which is SBS and DBS respectively.


What made you pick this representation in particular? I'm kind of curious what different kinds of algorithms you might have looked at.

Summarizing only blogs posts seems a bit limiting to me. (Btw, I'm not trying to be negative, congrats on your success! texteaser looks great!)

I implemented a custom version (mainly changed the scoring scheme to include TF/IDF of words for initialized scoring) of TextRank and loved it.

The main thing I liked about it was how general it was. Words are nodes and sentences are vertices. Then you basically use pagerank to rank the sentences according the graph representation.

[1] http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Mihalcea.pdf


Hi, I focus on blog posts because I don't want it to be broad. This was because TextTeaser is my research for my graduate studies. And having a broader research means harder to accomplish. But it doesn't mean it can't be used to other type of text. It can still be used. It's just optimized for news.

I'm a little bit familiar with TextRank because I stumbled upon it when I'm doing my research. I also read several algorithms but forgot what they are called.


Ahh very cool! Thank you for the insight. I could see where that would be applicable then. Using comments as features is a very neat concept.

News is the most broadly applicable use for this so leveraging that isn't a bad thing. There's always a trade off of broad applicability vs overfitting for a particular case to get better results.

Thanks for the insight! Again great work.


Any suggestions on papers for Luhn's abstract algorithm? I hadn't heard of it before.



Hey guys, I made a userscript at HackMIT last weekend that adds article summaries to the HN front page. It doesn't use the TextTeaser API (for the time being, at least) but the summaries seem to come out about the same anyway.

Check it out here: https://github.com/lukechampine/ADHN


From a quick test, it seems to treat almost every bit of content on a page equally, even elements which are clearly smaller and next to an image.

Might I recommend taking CSS styles into account? Large text is usually headlines, <strong> text is usually important, and darker greys generally suggest a side comment. Would be much easier if everybody used <aside> and <h1> but even in 2013 that's too high an expectation.


You are right, I'm not taking account of HTML tags. It is because I extract the text beforehand using Pythoon Goose. In that sense, only the text will be feed in the algorithm without any HTML tags.


Try https://github.com/visualrevenue/reporter :) I'm looking at your service now and it is really massively awesome. Can I ask, if you are considering monetizing it, or going the venture-path (boo)? I ask this because I'm curious on the viability of using your service/library on a long-term project.


He's monetizing it as an API here https://www.mashape.com/mojojolo/textteaser


if you paste this thread into the demo you get not very encouraging results. I haven't looked at the code but I suspect they find the sentences with most (cosine) similarity to the title and bias towards early sentences.

results:

- Hacker Newsnew | threads | comments | ask | jobs | submit hnriot (1618) | logout upvote TextTeaser – An automatic - summarization algorithm (github.com) - If you leave the "Title" field empty and click "Summarize", nothing happens -- which I thought was very confusing. - reply upvote downvote MojoJolo 1 hour ago | link I require the title because I need it for the algorithm. - not a criticism of textteaser (which was behind this excellent project https://news.ycombinator.com/item?id=6498625), - reply upvote downvote wikiburner 3 minutes ago | link Is this a well known text summarization tool?


Hahaha. It sucks. The algo is not meant for this kind of websites. Try out news articles! :)


http://www.textteaser.com/s/T4PQ1s

Not sure it captures the essence of the source article's argument. The fourth bullet makes no sense at all. I can't see it being useful at this stage.

Can you provide us with a list of articles that it manages to summarize properly?

It's great to see the project on github though. I look forward to seeing it improved over time. Thanks for sharing.


But it doesn't do well with sentence disambiguation. And the summaries aren't particularly good.

This isn't even on Par with Summly which was pretty hacked together.

https://www.mashape.com/stremor

Creates MUCH better summaries ans comes with all the stuff to separate Content from the web template.

If you contact Stremor there is also a version that scores every sentence for importance on a scale of 0-100 and maintains HTML so that you can return summaries of any length and still have images and other styling maintained.

( http://www.tldrstuff.com has several ways you can play with the tech )


https://github.com/MojoJolo/textteaser/blob/master/src/main/...

you post non-idiomatic(?) scala in a comment to explain what you are doing, i think? not a criticism of textteaser (which was behind this excellent project https://news.ycombinator.com/item?id=6498625), but seems to raise questions about the language...


that patch of code could be restructured a bit to make it more readable, here are a couple of small suggestions that jump out at me:

1. perhaps `.reduceLeft(_ + _)` could be replaced with the use of a `sum` function or method (assuming one exists in scala?)

2. if the `topKeywords` collection returned a default value with a `.score` of 0 when queried with a key it doesnt contain, the headOption getOrElse null match null would not be necessary.

e.g. in python it might look something like this:

    Keyword = namedtuple('Keyword ', ['score', ...etc...])

    top_keywords = defaultdict(lambda : Keyword(score=0, ...etc...))

    def sbs(words):
        if words:
            return (1.0/len(words)) * sum(top_keywords[w].score for w in words)
        else:
            return 0.0
(apologies for making superficial comments about the code. the algorithm itself certainly seems interesting)


You are right. I think Scala is a good language and handle functional programming well. But the code is too abstracted that even me might not get what it is doing. I just placed as a reminder for me. And also for everyone else to easily get what that piece of code is doing.


Curiously, I find easier to read the scala code than the commented pseudo code. I've been experiencing this a lot lately. It' seems I am loosing my ability to reason about code that loops explicitly.

One minor nitpick that can be of help when dealing with tuples: A partial function ( {case xxx => yyy} ) is a Function1 so you can use it with map and filter. This way you can deconstruct tuples into names and avoid using _1, _2, etc. { case (name, value) => blah }

https://github.com/MojoJolo/textteaser/blob/master/src/main/... could be made more readable by giving names to the tuple elements.

Thanks for publishing this code. It yields impressive results.


I've been interested in how it works since I first saw it! Can't wait for the documentation. Though I think I'm going to learn scala just to read through this. Thanks for putting it up!


Is this a well known text summarization tool? I hadn't heard of it before this post.



Yep, pretty well known!

Anyway, thanks for open sourcing - really cool.


In most news articles the first paragraph is already a summary.


Jolo, this is great! What is the implication for your API now? I notice that it's still available on Mashape and you're still charging a fee for it.


Hi! I will still retain the API in Mashape. That is for the developers that do not want the hassle to deploy it in their own servers. On the other hand, the open source code is for devs to check out the algo, hopefully improve and contribute to TextTeaser. If they want to use it and deploy it on their own, they are free to do so. :)

Think MongoHQ for MongoDB.


Great! You're a good man.


Really wish there was a way I could test the API without giving my CC info to Mashape. Even for the Freemium plan, I can't do a single request without giving payment info. Thus, I'm skipping this API, despite how cool it looks.

Edit: the main TextTeaser web site is down right now, which is why I went straight to the API to test.


Hey, you can contact mojojolo in Mashape through the Contact Now button at the bottom of this page https://www.mashape.com/mojojolo/textteaser#!pricing

He can set up a limited free private API for you to test. Let me know if you have questions about this process - chris@mashape.com


Cool.

What is the structure of the sent.model file inside the corpusEN.bin zip archive?

It's a strikingly small file for something called corpus. Say I have a larger corpus, or a corpus in a different language, how would I go about building one of these sent.model files with more data?


The corpusEN.bin file is the training data provided by OpenNLP which I used to split sentences (http://opennlp.sourceforge.net/models-1.5/). It's not the training data used for summarization.


Can you provide a bit more details about your approach? Are you using machine learning or just simple scoring based on some heuristics? From the look of the source code it seems to be the latter to me.


It's mostly statistical (simple scoring). But as you can see in this lines of code: https://github.com/MojoJolo/textteaser/blob/master/src/main/... I keep track of the keywords used by the blog and category before. Through it, TextTeaser employs a little bit of machine learning to improve the quality of the results.


Very cool. What made you open source it?

Also I remember it being a bit pricier to use the API. What made you to go down on price? I'm tempted to hook this up to my app right now.


There's a certain euphoria I get when I see a different color on GitHub than the normal ruby, python, shell, and JS.


Just checked it in Github. It's 100% Scala. :)


Why you used scala to build when python could have been an good alternative ?


I'm seeing good things about Python and NLTK. But back when I develop the core algo of TextTeaser (few months ago), I still don't know Python.

Right now, the TextTeaser website is coded using Python and Flask.


It shouldn't matter what language the OP decided to use, as long as it allows him/her to do whatever he/she set out to do in the first place.


Why use Python when Scala is a good alternative?


because NTLK is a very strong toolkit for natural language processing and i haven't found anything comparable in scala.


NLTK's strength is the clarity and flexibility of its code, for when you're experimenting with various processes and representations to find out what works.

If you have a single NLP model that already works, you wouldn't gain anything from rewriting it using NLTK. It would probably just get slower, because you're adding abstractions that you've already shown you don't need.

I say this as a fan of and (once) contributor to NLTK.


You can see from the source he's already using Apache OpenNLP. Scala is 100% interoperable with Java libs, so you have the entire Java ecosystem available, not just Scala code.


ya i saw that I'm not familiar with OpenNLP. lemme have a look it might solve my problem, I'm also starting a nlp project using scala :)


Why use Python instead of Scala?


How would I go about using this directly from Python, os.system calls?


you can either rewrite it in python, or use unirest/requests to summarize text using the API.


Why would "Philippine" be hard-coded in as a stop word?


Didn't manage to remove it.

I created a news reader for Philippine news (http://www.readborg.com/) using TextTeaser. The word Philippine appears most of the time and I decided to make it as a stop word. Forgot to remove it in the stop words.


works good enough for me. I'll give you $15MM for it.

-Marissa


This made me laugh. And $15m is big enough for me.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: