Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

PACER's antiquated interface and dirty data have fueled a wildly profitable legal services industry.

The public deserves more than the base line of free access to court records. Give the public access to clean normalized case data like in LexisNexis and Westlaw's outrageously-priced products, including a full database dump for civic hackers to improve.

Aside: I worked on a product which relied on scraping PACER. A foolish integration test on our CI server once racked up a $50,000 PACER bill which we didn't notice until the end of the quarter.



The concept of “civic hacking” is an error of projection. Because user-generated data is valuable in other areas of tech, techies assume its valuable in civics. Unlike with Internet technology, user-generated data has little value in politics and the law. Knowing that 76% of district courts did something one way has little persuasive value compared to knowing two or three well-reasoned court of appeals opinions. The data is in fact out there-all federal cases are available for free on Google Scholar and various other sites. And nobody has done anything interesting with that information.

That is not to say that politics and law can’t be data driven. But the data we need in those areas is not data you can derive by analyzing user or government generated databases. We need “hard data” such as how different tax policies impact economic growth. That’s not the sort of data analysis “civic hacking” can give us.


The users are generating the data already: getting arrested and sued. This data is not yet a URL you can share. Sign up for PACER and see how hard it is to associate a credit card and look up one case. It is not accessible to the casual user.

Perhaps we have witnessed different types of citizen contribution. I am not suggesting data science. Many would happily join a volunteer endeavor to clean up our country's case database and build a great UI on top of it which encourages public interest in the judicial system.

> The data is in fact out there

It is not. There is no complete, real-time, or normalized open case data out there, and LexisNexis and Westlaw's products are not priced to be accessible to individual citizens.


> Many would happily join a volunteer endeavor to clean up our country's case database and build a great UI on top of it which encourages public interest in the judicial system.

What would be the value of that? What problem is that trying to solve that is worth giving up $140 million a year in revenue that’s currently coming almost entirely out of the pockets of lawyers and law firms?

> The data is in fact out there It is not. There is no complete, real-time, or normalized open case data out there, and LexisNexis and Westlaw's products are not priced to be accessible to individual citizens.

I’m not sure what you mean by “normalized” in this context, but Lexis and Westlaw do a ton of additional analysis on top of public cases at their own expense. That information isn’t in PACER. To generate it you’re talking about even more expense. And while the existing databases are not updated in real time, they’re quite complete for federal cases, which is all that PACER covers. And I’ve never seen an interesting application of that data beyond simply allowing you to read the cases.


>coming almost entirely out of the pockets of lawyers and law firms?

You mean their client's pockets? Thus making legal assistance more accessible for the wealthy than the poor.

And just because you can't think of a good use of the data doesn't mean there isn't one. I can think of dozens of analyses I would love to run on a large set of cases and filings. You could develop models to predict how likely the language used in a filing is to contribute to a favorable ruling. You could measure the speed with which filings are handled and compare it across jurisdictions. Picking a specific type of case and one where there's a jurisdictional split implicating administrability, you could develop a rough proxy measure for the relative costs of the alternative rules. Etc etc. And that's just the limited subset of what I randomly dreamt up in three seconds that I'm willing to take the time to type out on my phone.

Data analysis like that is possible and even relatively easy with a large, standardized, freely available data set (which BTW is what the person you are replying to meant by normalized). Twitter has been analyzed to death using NLP for that very reason.


> You mean their client's pockets? Thus making legal assistance more accessible for the wealthy than the poor.

The price of legal services is based on supply and demand, not individual lawyers’ cost structures. Moreover, the rules already allow free PACER access for the poor. I’d bet the large bulk of PACER fees are actually coming from lawyers representing corporations and well-off individuals.


> What would be the value of that? What problem is that trying to solve that is worth giving up $140 million a year in revenue that’s currently coming almost entirely out of the pockets of lawyers and law firms?

You have it exactly backwards here. There is no need to justify stopping the flow of money but for the flow of money to exist in the first place. There is no right to a business model for such a thing can only end in abject insanity of demanding that it rain so you can sell your umbrellas.

>I’m not sure what you mean by “normalized” in this context

[Database normalization](https://en.wikipedia.org/wiki/Database_normalization) means ensuring that the data stays consistent across many sources and avoids redundancy. Issues like "There are two postings one for a Allan Smith in New York Court ABC Room A at 3pm on January 2nd, 1993 and one for Alan Smith in New York Court ABC Room A at 3pm on January 2nd, 1993 - which is correct?"

>but Lexis and Westlaw do a ton of additional analysis on top of public cases at their own expense. That information isn’t in PACER.

Then they have their own valid business model to sell their derivative works to supplement the public data that anyone else can get for free.


> You have it exactly backwards here. There is no need to justify stopping the flow of money but for the flow of money to exist in the first place. There is no right to a business model for such a thing can only end in abject insanity of demanding that it rain so you can sell your umbrellas.

No, you’re demanding the government give umbrellas away for free because people have a right to be protected from the rain. The government isn’t stopping you from distributing documents you got on PACER. It’s charging you for access to PACER itself.

> Issues like "There are two postings one for a Allan Smith in New York Court ABC Room A at 3pm on January 2nd, 1993 and one for Alan Smith in New York Court ABC Room A at 3pm on January 2nd, 1993 - which is correct?"

PACER just stores free-form PDF files. There is some metadata, but the substance of your example above would be in the PDF.

> Then they have their own valid business model to sell their derivative works to supplement the public data that anyone else can get for free.

That’s what they do. The original court opinions are free on PACER.


The opinions are free. The rest of the filings, which you need if you want information about a case which has not yet been decided, or if you want a fuller picture of what's going on in any case, are not.

Personally, I have no idea whether it's possible to do meaningful data analysis on PACER data, so I'm not necessarily arguing against your original point. But as a non-lawyer who sometimes gets interested in particular court cases, I find PACER fees obnoxious. Partly because the fees force me to think about whether I actually need a particular piece of information, which is so alien to the normal way I browse anything on the Web. And partly because the fees are the reason I have to actually deal with PACER's 90's-style interface: if all PACER data were free, I'm sure there would be at least one free website mirroring that data with a nicer interface.


I don't mind advocating against regulatory capture in general. If nothing else, improved access to this data will make the Westlas/LexisNexis industry more competitive.

There are a lot of ways governments generate revenue even if we want to restrict it to the legal industry. Obfuscating public information to the point that there's an oligopoly controlling access is fucked up.


You’re deeply confused about the situation if you think Westlaw and Lexis have anything to do with PACER. You use those services to access legal opinions. The underlying opinions are free on PACER and also usually posted on courts websites. Westlaw and LEXIS don’t index most of what’s in PACER. Conversely, PACER doesn’t contain most of what’s in Westlaw and LEXIS. PACER for the most part only has electronic files going back to the early 2000s. You can’t use it for serious legal research. Westlaw and Lexis have comprehensive data based of opinions going back to the 18th century because they went and scanned in all those paper records. (And West has been around since 1870 and has been building its collection ever since then.) The government (in fact, governments—there are hundreds of mostly autonomous court systems in the country) isn’t limiting access—it simply doesn’t have the data that makes West and Lexis so valuable.


Are you sure? I have not used Westlaw or Lexis, but from a quick search, both of them appear to offer services – albeit distinct from their 'main' services – that provide access to court dockets and corresponding documents [1] [2], which would account for the rest of what's in PACER.

[1] https://www.lexisnexis.com/en-us/products/courtlink-for-corp...

[2] https://legal.thomsonreuters.com/en/products/westlaw/dockets


You’re right, I forgot West/Lexis have docket tracking. I assume that gets data from PACER, but it also gets data from court “runners” (people who go to courts to get filings).

That being said, despite that the overlap between West/LEXIS is very small. The docket tracking is an adjunct service that you use in the unusual situation where you’re keeping tabs on a case you’re not participating in. Its not real time (day after) which is a big thing because the PACER/ECF notification is the official notice that triggers time periods and deadlines. It’s also vastly more expensive than PACER (like $50 per document).

That doesn’t address OP’s point, which is West/Lexis’s market dominance. And that is based on those systems having 200+ years of human annotated and indexed case law. PACER doesn’t have that data.


Exactly. Sounds like maybe courts should have higher filing fees.


What exactly is the “regulatory capture” here?


The higher than needed fees locking out competitors. Even if they didn't actually 'capture' them it is still regulation supporting their business model without a justifying good. Seat belts being compulsory may make more money for seat belt manufacturers but it isn't regulatory capture because it serves an actual purpose. Requiring all seatbelts be sourced domestically to shut out competitors would be regulatory capture however.


Are you seriously suggesting that PACER fees have any impact on competition? That’s like saying toll roads protect the business model of taxi companies. It’s completely nonsensical.


I am saying that taxi medallions have an impact on competition - that is a more appropriate analogy given the magnitude. It boosts the start up expense of competition considerably as opposed to just processing and serving the data at a discounted rate as a 'mirror'. Assuming about 30 KiB per page a mere Gigabyte comes out to about ~$3500. Which is vastly more than the about $0.08 of cost of 1 GB storage space.

It is a well known phenomenon given things like how quixotic the article 13 upload filter is because the expense locks out new competitors.


The concept of precedent in itself disagrees with you there - it isn't just a techie thing because a fundamental principle of law is consistency in application. The value may not be financial per say (nobody gets rich directly out of ensuring equal application of the law) but it has a major impact upon quality of life. While it may be used as an excuse to avoid thinking there is a value to precedent in itself in making the laws predictable and better in bounds of certainty.

Gathering the data of trials and judgement can prove when something isn't working and prove how things are actually working in the field better than even the best rhetorician. The question of 'should we be doing something' is separate from data.

But being able to get the cases together and point out that 'minimum/maximum sentences are constraining judges and juries since in 70% of the cases they note that they would go with lower/higher but they are constrained from it'. It can point out real problems with inequality in execution or corruption. Civic hacking is meant for accountability essentially to show how the system is or isn't working. Which is part of why resistance to it is so worrying.


That data is already out there. Almost the entire corpus of federal cases is available in data dumps online. I’ve never seen anything interesting done with it.

Don’t confuse my opposition to techno-optimism with opposition to accountability. The issue is not that we shouldn’t have accountability. It’s that there is already a ton of data out there, and civic hackers have had approximately zero impact with it. Because for the most part what you actually need is carefully controlled studies from real institutions, not “civic hacking.” Civic hacking is the techie version of “raising awareness.”


It's not just techies who think this is valuable. Yale Law's Information Society Project uses it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: