Not allowing the CFAA to be (ab)used to attempt to make scraping illegal makes s...

akersten · on Sept 10, 2019

Ugh, yeah, the more I think about this ruling, the less I like it.

It's actually pretty insane to force a site to serve content. I think both parties are in the wrong here - HiQ for assuming they're entitled to receive a response from LinkedIn's webservers, and LinkedIn for abusing the CFAA to try to deny service rather than figure out a technical solution to their business problem.

In my view:

* The data is public, and free of copyright. If you're a scraper and can get it, you haven't done anything wrong.

* The servers serving the data are still under LinkedIn's control, and they have no obligation or public duty to always serve that content. They could just as well block you based on your IP or other characteristics. If they want to discriminate and try to only let Google's scrapers access the data - what's wrong with that? Scraper brand is not a protected class. Tough taters if your business model "depends" on your ability to successfully make requests to another uninvolved company's webservers.

If I were the judge, I'd throw this out and let LinkedIn/HiQ duke it out themselves - they deserve each other.

Xelbair · on Sept 10, 2019

I would argue that under spirit of net neutrality you either serve your site to everyone equally(the public facing part) or to no one.

Hosting costs money, servers cost money.. but maybe create a public facing API that is way cheaper and easier to use than scraping your website? I see that ruling in positive light that it might promote more open and structured access to the public facing data.

dataflow · on Sept 10, 2019

> under spirit of net neutrality you either serve your site to everyone equally(the public facing part) or to no one

Huh? Net neutrality isn't about the server or client... it's about the network operator in between them.

cabalamat · on Sept 10, 2019

I suspect Xelbair is making a more expensive definition of net neutrality, taking as a basis the one that says it's about network operators only.

Darkphibre · on Sept 10, 2019

I think you wanted to say more expansive? But it's definitely also more expensive. :D

cabalamat · on Sept 10, 2019

Yes, I meant expansive. Oops.

Xelbair · on Sept 11, 2019

That was the case, hence the reference to the "spirit" of net neutrality.

Public facing internet sites, in my opinion, should be treated in same way as public space - anyone should be free to read, and write down in their notepad whatever is there, in the same way as anyone else.

Scraping public facing website in my opinion is huge waste of resources. It would be cheaper(in total) to build an API that can serve the data from it, than to build a good scraper.

neekburm · on Sept 11, 2019

Net neutrality is more about nondiscrimination in routing content from a provider to a user, rather than forcing content providers to serve everyone regardless of conduct. It's entirely reasonable for a site to discriminate who they wish to allow to access their data (whether technically their copyright or data they caretake).

That being said, if you provide data to the public, you don't get to invoke the CFAA to plug the holes your content discrimination code doesn't fill.

eru · on Sept 10, 2019

Why should you be forced to serve content to people who won't look at your ads?

RobAley · on Sept 10, 2019

Like disabled users with screen-readers?

ehvatum · on Sept 10, 2019

I suppose we can give them a pass if they solve a bunch of captchas.

kijin · on Sept 10, 2019

Anyone is free to put up a paywall and deny access to people who don't pay.

But LinkedIn is apparently happy to let Googlebot and bingbot scrape public profiles. If they want to do that, they can't argue that their policy is to block bots who don't click on ads. Discriminating Googlebot from other visitors is probably a violation of Google policies, too. They can't have their cake and eat it at the same time.

shkkmo · on Sept 10, 2019

From reading the opinion, I think the argument goes something like this:

> First, LinkedIn does not contest hiQ’s evidence that contracts exist between hiQ and some customers, including eBay, Capital One, and GoDaddy

> Second, hiQ will likely be able to establish that LinkedIn knew of hiQ’s scraping activity and products for some time. LinkedIn began sending representatives to hiQ’s Elevate conferences in October 2015

> Third, LinkedIn’s threats to invoke the CFAA and implementation of technical measures selectively to ban hiQ bots could well constitute “intentional acts designed to induce a breach or disruption” of hiQ’s contractual relationships with third parties.

> Fourth, the contractual relationships between hiQ and third parties have been disrupted and “now hang[] in the balance.” Without access to LinkedIn data, hiQ will likely be unable to deliver its services to its existing customers as promised.

> Last, hiQ is harmed by the disruption to its existing contracts and interference with its pending contracts. Without the revenue from sale of its products, hiQ will likely go out of business.

> LinkedIn does not specifically challenge hiQ’s ability to make out any of these elements of a tortious interference claim. Instead, LinkedIn maintains that it has a “legitimate business purpose” defense to any such claim. ... That contention is an affirmative justification defense for which LinkedIn bears the burden of proof.

So the real situation is that you can't go out and start blocking access you knew about in a way that would interfer with third party contracts without a legitimate business reason to do so. The burden of proving the legitimacy of that business reason is on you.

edit: TLDR;

> "A party may not ... under the guise of competition ... induce the breach of a competitor’s contract in order to secure an economic advantage."

tomp · on Sept 10, 2019

That’s quite ... crazy.

Be restaurant. Be on Deliveroo. Be getting low margins because of high fees.

So basically you can’t decide not to use Deliveroo any more, to improve margina (“secure an exonomic advantage”). I mean, you can cancel Deliveroo, but only as long as you’re not “inducing a breach of their contract”. So only a matter of time before Deliveroo writes a contract “we’re obligated to deliver food for you from said restaurant”.

XCabbage · on Sept 10, 2019

Choosing not to use a middleman any more so that you can secure higher margins sounds like about clearest example of a "legitimate business reason" imaginable. The purpose of the act is to immediately increase your margins, not to hurt Deliveroo because you don't want their competition.

That's very different from the case in question, where LinkedIn's motive for cutting off hiQ's access is to inflict damage on hiQ because they are a potential competitor.

olau · on Sept 10, 2019

I would imagine that if you contract with Deliveroo, they have some terms that say that you need to give notice when cancelling?

I don't know Deliveroo, but I think a better analogy would be if you suddenly, even though it is not causing you trouble, denied access to someone picking up food that you didn't contract with, with the full knowledge that the someone would be in big trouble with their customers.

davvolun · on Sept 10, 2019

IANAL, but I think you're misunderstanding "without a legitimate business reason to do so"

"Be Restaurant" blocking Deliveroo because they can't continue operating with the loss of revenue due to high fees is a legitimate business reason. "Be Restaurant" blocking Deliveroo 2: Electric Boogaloo because I don't like their owner, but continuing to allow Deliveroo access would be, presumably, disallowed.

Also there's nothing stopping "Be Restaurant" from offering an exclusive delivery contract to Deliveroo and forcing Deliveroo 2 out, or requiring a minimum fee for all delivery services, Deliveroo and Deliveroo 2 included.

Of course, I think this is all in a very different area from a restaurant; we're talking about a service provided on the internet. I believe LinkedIn has many, many other recourses here, but, as I see it, the courts are just telling them, this aint it chief.

rocqua · on Sept 10, 2019

So, if you want to block someone from your service, you need to be able to prove that it is for a legitimate business purpose.

Moreover it seems, 'this harms a competitor of ours' is not considered a legitimate business purpose, but anti-competitive behavior.

Thorrez · on Sept 10, 2019

Why does there need to be a legitimate business purpose? What about freedom of speech? It's my website and I'll publish what I want to.

olau · on Sept 10, 2019

Eh, I think you got this backwards. If you really want to talk about this in terms of freedom of speech, LinkedIn is in the act of censoring?

Edit: What I mean is that freedom of speech is not the same as freedom of censoring.

XCabbage · on Sept 10, 2019

> What I mean is that freedom of speech is not the same as freedom of censoring.

This is at least not quite true of First Amendment law. The concept of "compelled speech" exists in US law, and is considered an unconstitutional violation of the First Amendment. Exactly what falls into that category (and whether the right of domain owners to censor user-provided content as they see fit is protected), I'm not sure, but freedom of speech in the US certainly does at least sometimes include the right not to speak.

Thorrez · on Sept 10, 2019

Yes, the court was right to block LinkedIn's abuse of the CFAA. But the court was wrong that say that LinkedIn must show HIQ the same website as LinkedIn shows everyone else.

hgoel · on Sept 10, 2019

The ruling doesn't seem to say that they can't throttle access atleast.

greatpatton · on Sept 10, 2019

The data are certainly not free of copyright. Data can contain user picture, or even small essay describing the job, life of a user though linkedin is not the copyright holder. Moreover these are personal data, and I'm not sure that the scraper has the original user right to collect the data. In Europe, the scrapper may face issues related to GDPR.

johnny99 · on Sept 10, 2019

Facts can't be copyrighted, so such things as whether or not a person worked for a certain company, or went to a certain school, are unprotected, and with this ruling can be scraped, at least in the U.S. Others things common on LinkedIn, as you rightly point out, are protected--but by copyright law, not the CFAA. So a scraper acting in good faith would have to be careful about what they used if they wanted to respect copyright, but it's a separate issue from this ruling.

http://www.dmlp.org/legal-guide/works-not-covered-copyright

torstenvl · on Sept 10, 2019

This is exactly right. Copyright protects creative expression, not pure fact. Famously, phone books (remember those?) are basically not copyrightable except for the ads, because they're just lists of data. Feist Publications, Inc., v. Rural Telephone Service Co., 499 U.S. 340 (1991).

greatpatton · on Sept 10, 2019

I never said that fact can be copyrighted, I said that most of the things people put around in their profile can be. I was responding to the claim that the data were not under copyright made above. If you just scrap name, company, position, this is fine, but I highly doubt that they just do that. This lawsuit can have tons of side effects.

fauigerzigerk · on Sept 10, 2019

I think what hiQ does is to predict whether a particular employee is about to quit.

So the interesting question to me is whether you can lawfully make predictions based on published information if that information is under copyright.

In Europe the answer is probably no, because the assumption is that in order to analyse data you have to copy it first.

To me, this interpretation of the term "copying" makes very little sense. So I wonder what US law makes of it.

anticensor · on Sept 10, 2019

Europe has database rights, which has a fair dealing exemption for data analysis.

fauigerzigerk · on Sept 10, 2019

I'm not sure what "database rights" refers to specifically, but the whole matter is actually rather complicated, because the EU copyright directive has a lot of optional exceptions that member states may or may not adopt.

Most of these exceptions only apply to non-commercial use though. So they wouldn't apply in a case like hiQ.

UK specific exceptions are explained here:

https://www.gov.uk/guidance/exceptions-to-copyright

Unfortunately, both Labour and the Tories have taken a relatively hard line in the EU copyright negotiations, so it seems unlikely that things will be relaxed very much after Brexit.

anticensor · on Sept 10, 2019

Database rights are a copyright-like intellectual property regime for databases.

perl4ever · on Sept 10, 2019

"Facts can't be copyrighted, so such things as whether or not a person worked for a certain company, or went to a certain school, are unprotected"

There's an infinite number of ways to describe a job history, without any single standard, so I don't think it makes any sense to say that a profile or resume is not copyrightable.

jecxjo · on Sept 10, 2019

Isn't the issue of being selective on who can view the content? If I, random Joe User views the publicly available content you have no issue. But if someone scrapes that data them you'd want to charge them. Unless I click on the ad, the act of using your bandwidth doesn't change based on who the viewer is. You'd want to apply fees based on the future use of the data rather than on your actual costs.

jsgo · on Sept 10, 2019

I'd assume if you weren't signing up, you'd probably look at like 10 profiles tops. A scraper is more than likely going to run through anything and everything it can grab links to (provided it doesn't leverage a very specific filtering mechanism for selecting profiles to scrape).

I could see the hit from a scraper being heavier than that of a typical user. There's also the potential that a user is going to click an ad for any number of reasons, there isn't that likelihood the scraper will.

I'm not anti-scraping by any means, but I get the concerns.

pbhjpbhj · on Sept 10, 2019

Presumably you'd be allowed to limit a scraper to a standard user bandwidth, and a standard user access - X links per day, Y bandwidth.

pbhjpbhj · on Sept 10, 2019

Surely the action is "if you display stuff in public you can't segment the public".

You're not obliged to have public access.

Is there perhaps a factor here of users having an expectation that their profile is publicly accessible; so companies hosting that profile shouldn't be able to choose _secretly_ "who" can access it?

IfOnlyYouKnew · on Sept 10, 2019

You're inconsistent, and so are the courts and most comments here. Either you favour such conflicts to be decided by technological might, or by the clearly expressed will of the content publisher to have binding effect.

If you consider scrapers to have some sort of right to access any public website, any technological barriers inflict exactly the same harm as an injunction, assuming it is effective. IF you allow technical blocking, it would be preferable to allow blocking-by-clearly-stated-wish, because it would save everyone the costs of the arms race. It would also make both parties' success somewhat independent of the resources they can invest into outgunning their opponents.

kabacha · on Sept 10, 2019

> However, how is it reasonable to force a web site to serve its contents to a third-party company, without being allowed to make a decision whether to serve it or not?

Your statement makes absolutely no sense. That's not how internet works. If you serve something publicly you don't get to cherry pick who sees it.

Not only it makes no sense technically it's also a huge anti-competitive case.

sezna · on Sept 10, 2019

It makes sense and it is how the internet works. Servers cherry pick who sees their content all the time. Scrapers are often blocked, as are entire IP address ranges. Things like Selenium server scrapers can be (approximately) detected and often are denied access.

I’m not sure about being anti-competitive. Serving a website is an action in which you open up your resources for others to access. My friend runs an open source stock market tracking website for free. He started getting hit with scrapers from big hedge funds and fintech companies a couple of months back. This costs him around $50-100 a month to serve all of these scrapers.

ericd · on Sept 10, 2019

If he gives them a stable, fast API with a subscription fee, and the scrapers are truly from hedge funds, he’s going to make a lot more than $100/mo.

ddingus · on Sept 10, 2019

He should open up a Patreon, tip jar, something to get that funded.

Could also delay results, offer reduced temporal precision and other things to differentiate use cases.

sezna · on Sept 10, 2019

He and I both have similar free open source websites with donate buttons. They are rarely clicked. Ad revenue over a month for me has been ~$400 while donations over two years have totaled $20. There are about 80,000 unique visitors per month.

It is nice to think donation platforms can fund high traffic open source projects, but this is simply not the case.

In any regard, I fear the potential of this ruling limiting developers’ ability to protect their servers and making us all roll over to the big players with their hefty scrapers taking all of our data for resale.

bryanrasmussen · on Sept 10, 2019

how long are you allowed to delay results, I mean not serving results is just delaying them forever but that's out. Can I delay serving results longer than chromium's default timeout?

rocqua · on Sept 10, 2019

Probably up to the point where a judge says 'this is blocking not delaying'.

spoondan · on Sept 10, 2019

I don’t see what legal or technical argument you’re making.

Technically, of course you can identify IP ranges owned by certain entities and restrict their access. That’s trivial, so what do you mean when you say the internet doesn’t work like that?

Legally, there’s plenty of region locked content for copyright and censorship reasons. A distributor might region lock because they don’t have distribution rights in particular regions. Are you saying distributors can’t publish free content at all because they can’t choose who sees it but would be breaking copyright law to publish to everyone? Or a site might region lock because certain content is censored in particular countries. Can you not publish anti-regime articles because a totalitarian country is on the Internet?

The entire world isn’t and shouldn’t be held hostage to the most restrictive laws that exist in the world. And the answer isn’t blocking on the requesting end because that’s technically much harder and blocks much, much more content. So what am I missing?

Edit: Forgot to include the other end of the spectrum. If I, as an individual, host my own site on my own hardware with my own connection that I pay the bandwidth for, can I deny a suspected not network?

_ps6d · on Sept 10, 2019

Of course you get to choose. You can reject requests based on their user agent, their IP address, the owner or likely geographic location of the IP address, and many other possibilities.

kabacha · on Sept 10, 2019

What are these possibilities? You only get IP and client side information that client is _willingly_ sending to you. So if a script/user/bot/etc tells it's Firefox from 1.2.3.4 then all you know that it's a request from 1.2.3.4 that says it's Firefox. You can ask it to run Javascript code but that's beyond classic web interaction and then again you need to trust the client.

This interaction is impossible to be trustless thus every client can only be served based on their IP or some convoluted, hack exchange that is cat-and-mouse game at best.

austincheney · on Sept 10, 2019

LinkedIn’s public facing content is exactly that: public. This ruling merely says accessing public content isn’t hacking and so LinkedIn cannot use the CFAA as discriminatory weapon to limit access to that public facing content.

If LinkedIn wants to block access they need to do so by another means that isn’t described as hacking.

bryanrasmussen · on Sept 10, 2019

Does their robots.txt say don't crawl this part of the site? If it does, this ruling is catastrophic. If it doesn't then there is hope.

shkkmo · on Sept 10, 2019

> If it does, this ruling is catastrophic

I know it is a generally considered bad form to ask, but did you read much of the ruling? I feel like a lot of people on this thread are just going off of Animats' comment and haven't spent much time looking at the opinion.

I didn't read the whole thing, but skimmed through it and read what seemed to be the relevant parts of the argument. (Including the bit that talks about LinkedIn's robots.txt)

The ruling doesn't really support your claim of catastrophe and doesn't claim to pass any sort of final judgement.

The judge makes a specific point about not reading too much into him upholding the injunction saying:

>> These appeals generally provide “little guidance” because “of the limited scope of our review of the law” and “because the fully developed factual record may be materially different from that initially before the district court.”

johnny99 · on Sept 10, 2019

Compliance with robots.txt is and always has been voluntary, and many crawlers have long ignored it, including Archive.org.

paulie_a · on Sept 10, 2019

Do any scrapers actually pay attention to robots.txt?

decoyworker · on Sept 10, 2019

It's not forcing anything. Don't make a page public then? If a page is public then it is fair game.

test6554 · on Sept 10, 2019

This second part is pretty stupid, however, now that we are at this point, Linkedin still has the ability to decide which of its information is public and which is not. By making all of its information private, it can take back control.

dangerface · on Sept 10, 2019

I think the title is wrong and they are holding that linkedin can not block specifically hiq from viewing public data.

Which seems fair its public or its not, you can't pick and choose who its public for and who is a second class citizen.

DoctorOetker · on Sept 10, 2019

it's not the scraper's fault that their business model incorrectly assumed profitability through ads in a way that did not foresee compliance with future ani-scraper-discrimination laws.

it's a good point you bring up, and may contribute to the death of ads.