Really interesting. I tried it with an article I wrote for Wired - both to see how it handled lengthy content with multiple points, and also how it handled my 'loose' writing style.
Stremor's TLDR:
Pointless reeling off the numbers. The challenge each of us now faces is a brand new one. Our filters were once the media, our friends, and our families. We were aware of, and understood how our filters operated. EdgeRank isn't something that Facebook users understand. Just 20 tweets out of thousands. We need better filters.
Text Teaser:
How do we create a balanced diet of content with so much junk being thrown at us?
Now, large media organisations create mountains of content, then track our reading habits and online behaviour in order to build a profile of us.
A set of favourite news groups; a list of RSS feeds; a well-curated bookmarks folder; these are all filters we once built ourselves.
The less we understand our filters, the more we will come to accept that the world they present us with is true.
The more control we have over our filters, the more we can understand what we're not seeing.
Text Teaser goes over 350 Characters which is the Established People can't sue you for stealing it limit... So also weigh that when deciding which you like better.
Yes, tried it on a couple of my own pages. Seems to work ok.
Who is going to be first to couple this library with an RSS feed reader and mailer so that I can get auto-generated summaries of recently written articles sent to my blackberry?
Personally, I find the "email me news" scheme obnoxious. I get enough emails as is it. Would prefer to see a portal that shows summaries of all the news and lets the user explore.
Once I was at the screen showing me the summary, in the right hand 'Share' column - I can't copy the text in the link, image, or embed fields. (FF4.0 on OSX 10.8.5)
No, I believe he actually meant LSA = latent semantic analysis, which is an algorithm used to extract topics.
I am also curious about how the NLP/ML parts are implemented, as it's claimed by the README on github. Briefly scanned the code but didn't really spot it.
If you leave the "Title" field empty and click "Summarize", nothing happens -- which I thought was very confusing. You have to fill in something for the Title.
That explains the many tests I just did with random copy and pasted articles. I just typed gibberish into the title. I mean, not all texts need to have a title.
Maybe in the future I can improve it without requiring the title. It may produce good results to other type of texts, but right now, TextTeaser is meant to be used for news articles.
Since the title (headline) of a news article usually summarizes it, TextTeaser arguably is less an article summarizer than a headline expander...
EDIT it would be nice, from a UX POV, to request the title if it's missing, rather than silently deleting the story... also, you might emphasis the importance of it (because it doesn't seem important at all). Perhaps just labeling it as "headline" or "subject" instead of the generic "title" would help.
Looks like the algorithm is giving weight to h1 and h2 tags in the page markup having just tried it on some of my pages. Is that true or am I imagining it?
If so, I'll have to provide more literal subheadings!
This was also the foundation for Summly. The problem is that it is flawed. It doesn't take in to account emotion, or emphasis.
What is the most important sentence in this:
Drakaal is a poopy head. He often posts to hackernews and calls people an idiot. When this happens I get mad.
The Premise is "Drakaal is a poopy head" so that is the most important, but it doesn't inform the user. What he does is the most telling of the sentences, but with the "He" as the first word you can't actually make sense of the content with out the prior sentence. The last sentence is the least important for the understanding of the content.
It is important to know that sentence number 2 is the most informative, but that to figure out what it means requires Sentence 1.
When computing the results of a summary you have to weigh sentence dependencies, density of information, amount of emotion expressed and number of characters available to you.
And Keywords aren't enough, you need noun entities and the ability to tell the relationships of words so that you know "Cars, Trucks, and Automobiles" are all the same concept in many contexts.
You are spot on on the paper that I referenced. As you can see in 4.3 which is in page 3, the paper mentioned two algorithms for sentence selection. These are Summation-Based Selection and Density-Based Selection. Which is SBS and DBS respectively.
What made you pick this representation in particular?
I'm kind of curious what different kinds of algorithms you might have looked at.
Summarizing only blogs posts seems a bit limiting to me. (Btw, I'm not trying to be negative, congrats on your success! texteaser looks great!)
I implemented a custom version (mainly changed the scoring scheme to include TF/IDF of words for initialized scoring) of TextRank and loved it.
The main thing I liked about it was how general it was. Words are nodes and sentences are vertices. Then you basically use pagerank to rank the sentences according the graph representation.
Hi, I focus on blog posts because I don't want it to be broad. This was because TextTeaser is my research for my graduate studies. And having a broader research means harder to accomplish. But it doesn't mean it can't be used to other type of text. It can still be used. It's just optimized for news.
I'm a little bit familiar with TextRank because I stumbled upon it when I'm doing my research. I also read several algorithms but forgot what they are called.
Ahh very cool! Thank you for the insight. I could see where that would be applicable then. Using comments as features is a very neat concept.
News is the most broadly applicable use for this so leveraging that isn't a bad thing. There's always a trade off of broad applicability vs overfitting for a particular case to get better results.
Hey guys, I made a userscript at HackMIT last weekend that adds article summaries to the HN front page. It doesn't use the TextTeaser API (for the time being, at least) but the summaries seem to come out about the same anyway.
From a quick test, it seems to treat almost every bit of content on a page equally, even elements which are clearly smaller and next to an image.
Might I recommend taking CSS styles into account? Large text is usually headlines, <strong> text is usually important, and darker greys generally suggest a side comment. Would be much easier if everybody used <aside> and <h1> but even in 2013 that's too high an expectation.
You are right, I'm not taking account of HTML tags. It is because I extract the text beforehand using Pythoon Goose. In that sense, only the text will be feed in the algorithm without any HTML tags.
Try https://github.com/visualrevenue/reporter :) I'm looking at your service now and it is really massively awesome. Can I ask, if you are considering monetizing it, or going the venture-path (boo)? I ask this because I'm curious on the viability of using your service/library on a long-term project.
if you paste this thread into the demo you get not very encouraging results. I haven't looked at the code but I suspect they find the sentences with most (cosine) similarity to the title and bias towards early sentences.
results:
- Hacker Newsnew | threads | comments | ask | jobs | submit hnriot (1618) | logout upvote TextTeaser – An automatic - summarization algorithm (github.com)
- If you leave the "Title" field empty and click "Summarize", nothing happens -- which I thought was very confusing.
- reply upvote downvote MojoJolo 1 hour ago | link I require the title because I need it for the algorithm.
- not a criticism of textteaser (which was behind this excellent project https://news.ycombinator.com/item?id=6498625),
- reply upvote downvote wikiburner 3 minutes ago | link Is this a well known text summarization tool?
Creates MUCH better summaries ans comes with all the stuff to separate Content from the web template.
If you contact Stremor there is also a version that scores every sentence for importance on a scale of 0-100 and maintains HTML so that you can return summaries of any length and still have images and other styling maintained.
you post non-idiomatic(?) scala in a comment to explain what you are doing, i think? not a criticism of textteaser (which was behind this excellent project https://news.ycombinator.com/item?id=6498625), but seems to raise questions about the language...
that patch of code could be restructured a bit to make it more readable, here are a couple of small suggestions that jump out at me:
1. perhaps `.reduceLeft(_ + _)` could be replaced with the use of a `sum` function or method (assuming one exists in scala?)
2. if the `topKeywords` collection returned a default value with a `.score` of 0 when queried with a key it doesnt contain, the headOption getOrElse null match null would not be necessary.
e.g. in python it might look something like this:
Keyword = namedtuple('Keyword ', ['score', ...etc...])
top_keywords = defaultdict(lambda : Keyword(score=0, ...etc...))
def sbs(words):
if words:
return (1.0/len(words)) * sum(top_keywords[w].score for w in words)
else:
return 0.0
(apologies for making superficial comments about the code. the algorithm itself certainly seems interesting)
You are right. I think Scala is a good language and handle functional programming well. But the code is too abstracted that even me might not get what it is doing. I just placed as a reminder for me. And also for everyone else to easily get what that piece of code is doing.
Curiously, I find easier to read the scala code than the commented pseudo code. I've been experiencing this a lot lately. It' seems I am loosing my ability to reason about code that loops explicitly.
One minor nitpick that can be of help when dealing with tuples: A partial function ( {case xxx => yyy} ) is a Function1 so you can use it with map and filter. This way you can deconstruct tuples into names and avoid using _1, _2, etc. { case (name, value) => blah }
I've been interested in how it works since I first saw it! Can't wait for the documentation. Though I think I'm going to learn scala just to read through this. Thanks for putting it up!
Hi! I will still retain the API in Mashape. That is for the developers that do not want the hassle to deploy it in their own servers. On the other hand, the open source code is for devs to check out the algo, hopefully improve and contribute to TextTeaser. If they want to use it and deploy it on their own, they are free to do so. :)
Really wish there was a way I could test the API without giving my CC info to Mashape. Even for the Freemium plan, I can't do a single request without giving payment info. Thus, I'm skipping this API, despite how cool it looks.
Edit: the main TextTeaser web site is down right now, which is why I went straight to the API to test.
What is the structure of the sent.model file inside the corpusEN.bin zip archive?
It's a strikingly small file for something called corpus. Say I have a larger corpus, or a corpus in a different language, how would I go about building one of these sent.model files with more data?
The corpusEN.bin file is the training data provided by OpenNLP which I used to split sentences (http://opennlp.sourceforge.net/models-1.5/). It's not the training data used for summarization.
Can you provide a bit more details about your approach? Are you using machine learning or just simple scoring based on some heuristics? From the look of the source code it seems to be the latter to me.
It's mostly statistical (simple scoring). But as you can see in this lines of code: https://github.com/MojoJolo/textteaser/blob/master/src/main/... I keep track of the keywords used by the blog and category before. Through it, TextTeaser employs a little bit of machine learning to improve the quality of the results.
NLTK's strength is the clarity and flexibility of its code, for when you're experimenting with various processes and representations to find out what works.
If you have a single NLP model that already works, you wouldn't gain anything from rewriting it using NLTK. It would probably just get slower, because you're adding abstractions that you've already shown you don't need.
I say this as a fan of and (once) contributor to NLTK.
You can see from the source he's already using Apache OpenNLP. Scala is 100% interoperable with Java libs, so you have the entire Java ecosystem available, not just Scala code.
I created a news reader for Philippine news (http://www.readborg.com/) using TextTeaser. The word Philippine appears most of the time and I decided to make it as a stop word. Forgot to remove it in the stop words.
Really surprised with both the quality and succinctness of the result: http://www.textteaser.com/s/t1bNud
Well done.
(Also to the project owner - copying the link is borked in Firefox, I had to type it out manually)