Since deep learning is dominating the news these days, this might be the opportunity to point to the potential new ways of dealing with ad blocking. One is using character-based convolutional (or recurrent) networks that read the URL characters and classify it as legit or ad.
The interesting gain here is that there's no more conflicting regex nor optimization to work on huge lists of blocking rules. Instead, URLs can be passed to the network in batches.
This is kinda shameful plug, but I'm really interested in feedback on these new ways of ad blocking with high accuracy.
For sure running a deep net service is not as easy as installing ublock but there are ways. The whole source code is Open Source, as is the model. I have more data at hand, and some larger models could be built. Performance assessment would be a good next step as well.
EDIT: the linked page is rather long, lookup 'Novel task' to get to a quick classification example
Part of the requirement of adblock software is to inject itself into the request pipeline in realtime and reject ad requests while at the same time causing as small a stall as possible and also using the least amount of resources. NN's running inside a browser don't satisfy any of those criterias.
I am still unconvinced if NN provide any advantage over regular expressions in this domain since it's not a hard problem to solve. Also, the request needs to be rejected before the connection is made so the only data to work with is the HTTP request headers.
On an enterprise level, a better challenge to solve with NN would be to create a asic accelerated neural net hardware to filter packets/connections of IPS/IDS and firewall purpose with an extra ability of also possibly blocking advert.
A URL character based classifier would be very very easy to game.
Perhaps you meant an image-based classifier, though in this case Pharoah2's objection about speed is an insurmountable issue even on an enterprise server, since classifying a website on a CPU would take many seconds; on a GPU it would probably be several hundred milliseconds, which is a few hundred milliseconds too many!
Performance is in ms, see http://caffe.berkeleyvision.org/performance_hardware.html
Lookup the testing numbers, 500 images / sec.
Character-based models are way simpler, you can use test very easily using the original link to the model.
They are not easily gamed either: they learn features as convolutional filters over the characters. These filters are more powerful than ngrams, as complex as the filters you may have seen for images.
Now, about gaming the nets, there's an issue, which is the same as for image-based CNNs: some combinations of letters (or pixel) do exist that do not make a difference to the eye, but push the classifier into the wrong class. However, in order to find these combinations, having a first-hand on the underlying neural net and its weights is mandatory.
The performance numbers are not representative of your proposed workload. They are 1) images either from RAM or sequentially on disk that are 2) batched and tested together with vector operations all the way (memory bandwidth also dominates the latency for this reason), and 3) the images are only 256 x 256 pixels (much smaller than a website).
Also, the neural net occupies something like 1.2 gigs of RAM, and something like 3 gigs of GPU memory.
The character-based net reads the URL, not the website. The common alphabet used in the model made public is of size 69, URL length can be for example 256.
Not only this is below the image size you are talking about, but after the first convolution the alphabet (in one-hot vector encoding) is collapsed, which leaves convolutions across length 256 in 1D.
Not that terrible. In proxy-like territory, passing batches of URLs to the GPUs could pave the way to new ad-blockers at large. To be tested :)
I was referring to the hypothetical image classifier, not the URL text classifier (the URL text classifier, as I mentioned, would perform very poorly).
Would crowd sourcing help? Like, what if when you block an element on a page, it asks if what you're blocking is an ad (or something else "no one" wants to see). If they click yes, it sends that URL or whatever to the neural net servers. If enough people block that element, it'll get blocked.
Possibly. Having millions of URLs to populate each class is a good thing to start with. Gathered through other means, our current dataset has around 10M URLs in the 'ads' category. The model we made available to the public was built from 2M of these URLs.
EDIT: of possible interest is that these models output a probability and possibly a confidence of having a URL blocked. Base on these, an blocker could ask for confirmation.
There are already commercial products that use NN and other ML techniques to classify and block malware, click fraud etc. - Cisco has one. It's certainly a valid and viable approach, even in the face of determined opposition.
Implementation aside, I agree with your first sentence. It could also create a real incentive to try gaming NNs, that might help to further their capabilities.
Apart from efficiency the most important difference between these two is that Adblock has a default Whitelist which allows certain type of ads to pass (i.e. by charging advertisers like Google and Taboola millions of dollars to unblock their ads).
While ABPs business model is ethically questionable, it is a win-win for all involved parties:
1) end users get spared from the worst bunch of ads (layer ads, huge screen-filling ads in the left/right margins between content and screen, auto-playing video ads, popup/popunder ads, ads that mess with your browsing history, pre-roll ads on video sites)
2) site owners don't have their revenue stream completely fucked with more and more people using ad blockers
3) advertisers, provided they play by the rules, still have a way to get their content out to users
4) ABP/Eyeo has financial resources for development, hosting, maintenance and is able to keep up in the whack-a-mole game with the nasty parts of the ad distribution networks
It sounds nice in theory but ABP charges an outrageous 30% and allows anyone who pays, regardless of the quality or performance of their ads.
You can look at their whitelist and find all sorts of intrusive and shady ad networks in there. This kind of funding model is bound to incentivize ABP into taking the wrong actions and working with bad actors.
There was some kind of schism between the original creator and someone who was intended to take on the project and the result is two extensions with the original changing its name.
Agreed, Firefox for Android isn't quite as slick as Chrome (yet) but not having to deal with obnoxious full page adds with tiny tap targets is a massive win.
I use the original AdBlock for Chrome (not ABP). One of its features that I find indispensable is the ability to sync custom filters, filter lists and settings via Dropbox as I use Chrome on at least 3 different machines. I'd be willing to switch to uBlock Origin if it had a sync feature.
Why is blocking as a browser extension common but blocking with a proxy not? I use a proxy[] which means I get ad blocking in my mail client, RSS reader etc. Is a proxy too much work?
[] I use glimmerblocker. It's OK, the biggest "problem" is that I had to do a lot of tuning -- I think it's just one developer. And I had to write some code to get it to support HTTPS traffic (basically my proxy on my own machine had to perform a MITM "attack" on my behalf).
Because there is more friction in setting up an ad blocking proxy, secondly your usage patterns differ from the normal where a majority are almost completely browser dependent
I've started having this problem as well. YouTube videos with ads return an error for about a minute, then start. I've started just disabling it on YouTube.
CSS files often come from different hosts than the original site, usually CDNs or other static content hosts. Are you sure you enabled access to those hosts?
What you describe is exactly what I'd expect to see, and what I've seen with other filtering extensions, if the static hosts weren't enabled.
Their FAQ fails to answer the basic questions - who runs it, how is it funded and why should it be trusted?
Your DNS server can log the domains you request and can serve malicious replies (particularly bad for HTTP and other non-authenticated protocols). Trusting some random server is not the smartest decision.
It solves a lot of efficiency problems, but is hell for debugging a broken webpage. For that reason I feel the browser extension justifies its footprint.
The interesting gain here is that there's no more conflicting regex nor optimization to work on huge lists of blocking rules. Instead, URLs can be passed to the network in batches.
I have conducted my own set of experiments and a usable deep model is available (see http://www.deepdetect.com/applications/text_model/ ).
This is kinda shameful plug, but I'm really interested in feedback on these new ways of ad blocking with high accuracy.
For sure running a deep net service is not as easy as installing ublock but there are ways. The whole source code is Open Source, as is the model. I have more data at hand, and some larger models could be built. Performance assessment would be a good next step as well.
EDIT: the linked page is rather long, lookup 'Novel task' to get to a quick classification example