Hacker Newsnew | past | comments | ask | show | jobs | submit | barryhunter's commentslogin

There are quite a few Ngram datasets available https://www.google.com/search?q=download+n-gram+dataset

... these are almost certainly used in many spelling and grammar checkers. (To help with where the same spelled word is used in different context)

http://www.aclweb.org/anthology/W12-0304


Yes, I remember trying to use Google Books Ngram Dataset [1], but it was too tedious for me to setup and maintain a server with the data for a purpose of a quick-and-dirty tool (that's why I asked for a ready API). Still, using it is probably a nice idea for a more ambitious side project or even a startup.

EDIT. Actually I would happily pay for a tool that implements the idea. Grammarly has paid plans but $30/month is too steep (for my types of usages), and the types of grammar checks it performs is not exactly what I need (which is what real people in real situations use).

[1] http://storage.googleapis.com/books/ngrams/books/datasetsv2....


We (foxtype) actually have a dev tool that does exactly this.

If we publish it as an online tool do you think people will find it useful?

We have multiple corpora, some language models built in neural networks, etc.


LanguageTool has limited support for using Google's n-gram data to find spelling errors. It only uses 3-grams, and only for a list of commonly confused words. I'm not aware of any Free Software that does better.

http://wiki.languagetool.org/finding-errors-using-n-gram-dat...


Did you look carefully at the date this was posted on the mailing list? :)


It's up to you if want to take the risk :)

There is always a risk in trying something new. It might never pan out, or disappear in a puff of smoke.

Nothing would happen if nobody took a little risk.


what does it used as the backend? The actual classification system used?


Looks like it uses Python on the backend, with nltk. http://www.nltk.org/


Using nltk for tokenizing and stemming text. There is no classification backend, I implemented a Naive Bayes classifier.


untick autoShuffle in the controls.


Not available on mobile.


Thanks!


then link dropped on hn, to get even more traffic?


Slightly self motivated yes. That being said I've spent two years developing applications that could greatly help communities across the world and was having a hard time getting those ventures the proper attention. I theorized that something "less-impactful" would get more attention and in the long run drive attention to more important projects. I've been sharing my google analytics on the facebook page and with friends as a way for all of us to better understand the influence of different platforms. 32% of traffic has come from Facebook which is pretty much free to all startups. I then ran an ad campaign for five days which performed miserably. I will be carefully examining traffic sources and how to cheaply launch my app in the coming month. I don't expect a great deal from tiny tiny trophies I put up a wordpress site in a night and have been watching it in the past week.


Anyone know the criteria for what sites are included?


Seems to be the premise of http://www.majestic12.co.uk/ - looks like it might be failing.


There is a 'Technology' link in the footer.

Really nice implementation btw.


The logger was obviouslly there, it was deliberatly collecting the SSIDs and MAC addresses.

Possibly a debug option to log the whole packets added during development, and it was accidently left on in production.

Or the whole packet was always logged, a second process would then skim just extracting the SSID/MAC (correlating with GPS), and another process was deleting the raw logs. That third process failed.

A few big drives in teh data collection devices, and possibly nobody noticed where filling up a little too quickly.


Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: