PACER is definitely interesting, a bit antiquated, and to date, the data has mostly resided in the hands of the big information companies (Lexis, Westlaw, etc.).
I've been building a system/website to access, search and develop intelligent analytics from PACER court information. We're tracking cases, attorneys, parties, judges, as well as the actual case dockets. The data is a treasure trove of information, and if anyone's interested, I'd be very happy to chat more about it.
The site (a signup for now as I'm working out the kinks in the system) is www.docketleads.com. Email me there or ping me here for more info.
I worked on a similar project a decade ago written mostly in Perl with the frontend in PHP (hey, it was 2004, folks!). Just checked and I still even have the old courtbot.com domain I registered for the project.
I suspect you'll find pretty quickly that there's a limit to how far regular expressions or similar techniques can take you if you want to normalize and reference precedents and make sense of cases. That's why Lexis and Westlaw pay actual attorneys considerable sums to summarize cases, and why they can still command such princely subscription fees even in 2014. But analytics might be interesting. A family member is a judge, and her judicial office keeps track of how many cases she decides per month, how many reversals she receives, etc. I don't know if those are made public -- certainly I'm not aware of any project to do it across a large data set, and I wish you luck with it.
You're definitely right about case/precedent information, but what we've found is that there's a whole other world of info that can be neatly organized with a lot of crunching, and a small bit of manual manipulation.
The big guys chasing this are highly focused almost entirely on lawyers, in the context of providing them case analysis tools. We've found a bit of a different niche which doesn't need as much fidelity/granularity to the information, but needs it nonetheless.
In any case, I'd love to chat about your experience, even if a decade old. Can I PM you?
Sure, happy to chat! What you're doing seems interesting, especially if you're not targeting the lawyer/case research market. My email address is in my HN profile. Though I am working nonstop on http://recent.io/ right now. :)
I wonder if Recap [1] would help in addressing the censorship/deletion issue. Ultimately, the way we fund these programs is the root the problem (and the privatization of what is supposed to be public data).
RECAP hasn't been nearly as active as its initiators had hoped. The data there is pretty good however, and for a handful of key cases, I would say it's very good. The biggest issue with the data is that it's spotty. Since it relies on individuals to pull info on each case, some cases may only have partial information (not all the parties, attorneys, etc. represented), or not have the full docket available (and rarely, if ever, all of the documents associated with a case).
Recap only helps here if everyone accesses/pays for all of the files that are about to be deleted, and doing so would surely cost millions or billions.
The judiciary sees it as a profit center.
Folks have offered to essentially buy the data and make it entirely public.
But they see too much profit from it
Of course nobody is "profiting." There are no shareholders getting dividends or execs getting bonuses. They use it fund the operations of the judiciary in the face of a shortage of funding from Congress.
Except, of course, that PACER has various requirements that conflict with this, and they make as hard as possible to keep this profit (which was 150 million in 2008) up.
For example written opinions that "set forth a reasoned explanation for a court's decision" must be free of charge.
They make it is as difficult as possible to access this, and do not allow any sort of bulk download, because doing so would make PACER/courtweb less useful as a pay service.
It would cost Google negligible money to host this data and the only people who would be upset would be the rent-seeking jerks responsible for the current PACER debacle.
Un-fucking-believable. PACER has always been awful (I've used it since about 2005), but this is a new low---this is ACTIVE awfulness.
I assume, based on the weird specificity of what they're removing, that the PACER office is doing this at the request of the individual courts. Which just sort of underscores how awful this is---that courts get to decide how public their own opinions are.
Not so. The AO forced the courts to do it according to two people at the Second Circuit.
The most likely explanation is that as part of the "upgrade" of CM/ECF (the write component of PACER) they needed to jettison old databases that used a different schema. This is of course nonsense. They've likely spent over $100 million on this upgrade since 2007, though actual numbers are surprisingly hard to come by. For that price they could have probably afforded a few coders to convert the older databases over.
Yep. Though Congress could liberate all of PACER, retrospective and prospective, if it chose -- one data dump to Carl Malamud would do it. The appropriations bills are wending their way through the legislative process right now (mostly out of committee), and that might be a vehicle to add a one-line amendment. Would require a lot of work in the next month or two.
I'd point you in the direction of Carl Malamud, Jim Harper at the Cato Institute, and EFF, probably in that order. Jim's made it a project to liberate government data; Carl's gone further and made it his life's work.
Inside Congress itself? Hmm. I'm spending my time working on http://recent.io/ and now paying close attention nowadays. But if you're local to the SF south bay try Rep. Lofgren? I've done some Q&As with her and found she's one of the smarter and well-informed members of Congress on tech policy issues.
I've been building a system/website to access, search and develop intelligent analytics from PACER court information. We're tracking cases, attorneys, parties, judges, as well as the actual case dockets. The data is a treasure trove of information, and if anyone's interested, I'd be very happy to chat more about it.
The site (a signup for now as I'm working out the kinks in the system) is www.docketleads.com. Email me there or ping me here for more info.