Full text search on 400M US court cases

mrits · on Nov 19, 2020

It's pretty hilarious and somewhat frightening I found my dad's arrest 25 years ago for a speeding ticket he had "forgotten" to pay. I remember being 11 years old and having to wait 8 hours for my parents to come back from picking up a pizza. Data availability is crazy.

hbt · on Nov 19, 2020

the frightening part is although your father's record is available to the public, police officers who are caught lying while testifying get to seal the record.

in many jurisdictions, sealing a record is the equivalent of destroying it.

your crimes will haunt you forever because the system never forgets, meanwhile they simply go back to business like it never happened

ref https://www.google.com/amp/s/www.nytimes.com/2018/03/18/nyre...

unishark · on Nov 20, 2020

I think judges can seal records for anyone. A friend of mine had it done after his conviction several years ago. Sure enough I can't find him in the database. He still notifies potential employers about it though.

notyourwork · on Nov 20, 2020

He’s legally required to.

handol · on Nov 19, 2020

Found an assault charge on my Mom from '92.

richardbarosky · on Nov 19, 2020

Do you think the data should be removed from the government portals? Those are interesting points. What do you think is the right balance to strike?

I can see why it might be surprising to find some results when searching. The same data has already been available in many other databases that have existed long before this one and in those described on the info page as well.

handol · on Nov 19, 2020

It's on the internet forever now. If there's a balance to strike it would have had to have been done in 2007 when the court digitized their records and put them online.

A search for "minor consuming" reveals a few hundred thousand cases against children. I'm a little surprised to see that.

codenesium · on Nov 19, 2020

Minor in this context is often 18 to 20 years old.

lostlogin · on Nov 20, 2020

And drinking at that age is legal in many places.

I’d imagine there are a fair few people on that list who’s crimes are for things that are now legal.

Waterluvian · on Nov 20, 2020

Arrest for a speeding ticket? Good God.

Justin_K · on Nov 20, 2020

The arrest was for failing to appear in court.

cocoa19 · on Nov 20, 2020

In other countries, if you are speeding, you get a ticket by mail. If you are driving under the influence, you are sent to jail for the night.

Why do taxpayers have to pay expensive court proceedings, and offenders have to spend a lot of money for an attorney, and waste a bunch of time.

filoleg · on Nov 20, 2020

>Why do taxpayers have to pay expensive court proceedings, and offenders have to spend a lot of money for an attorney, and waste a bunch of time.

They don't have to. The usual process for something like a speeding violation in the US is:

1. You get stopped for speeding (or get caught on a speedcam).

2. You receive the ticket in the mail.

3. At this point, you have an option to agree with it and pay the fine OR appear in court and hope they will rule in your favor (which could easily happen if you genuinely believe they were wrong; and half the time, the cop himself will fail to appear in court anyway, so you get the ticket dismissed if it wasn't anything too wild).

You don't have to appear in court (if you choose to accept the ticket and pay the fine) or waste money on attorneys (if you choose to contest the ticket in court). You can literally play it the same way in the US as you just described, by getting the ticket in your mailbox and paying it off (for something like speeding). That's it. Contesting the ticket in court is just another option available to you.

What happened to the parent commenter who mentioned failing to appear in court, they basically didn't pay the ticket they received (aka ignored it) and didn't show up in court to contest it either. That's pretty much it.

Waterluvian · on Nov 20, 2020

Ah yeah suppose that can escalate.

impalallama · on Nov 20, 2020

Well I don’t even know what’s going on here. https://www.judyrecords.com/record/hxv7x79e609

chris_wot · on Nov 20, 2020

"In Nix v. Hedden, 149 U.S. 304, 37 L. Ed. 745, 13 S. Ct. 881, the question presented was whether tomatoes were to be classed as fruit or vegetables under the tariff act. The court found no particular help from the witnesses called, and decided the point through the use of a dictionary, which was not evidence, but an aid to memory and understanding."

https://www.judyrecords.com/record/5sfv13kuy1081

sushshshsh · on Nov 20, 2020

Ahh yes, the John J. Bitch from Slutville who lives on Whore Street. I know him well. But I don't remember him having pink eyes or being 335 pounds.

mrits · on Nov 20, 2020

apparently the search box repopulates from a cookie. Last thing I search was my name. I clicked your link and freaked out for a second.

bgroins · on Nov 20, 2020

I'm going to go out on a limb here and say this is probably a test record for whatever software they use.

octoberfranklin · on Nov 20, 2020

Compelled prostitution of a person under 17, duh. I mean, obviously. By a person named "bitch" in "whoresville, usa". It was all authorized by "PIMPDADDY", as shown there, plain as day.

pluto9 · on Nov 20, 2020

"John Julio Bitch" there is quite the baller. Released the very same day for a $25 million cash bond. He's also apparently an albino who stands 5'1" tall and weighs 335 pounds.

lostlogin · on Nov 20, 2020

That is amazing.

There are lots that are amusing, with most names or insults getting a hit. It’s seems the ‘AKA’ field has all sorts of entries.

godmode2019 · on Nov 19, 2020

I feel this is bad news. Some things should be forgotten. In my country your record gets soft wiped after 8 years. With this the employer could just look up your name.

mushbino · on Nov 20, 2020

I found my conviction for assault from a bar fight 23 years ago. Since then I quit drinking, went to college, raised a child who is now 20, and turned everything around. It's pretty disheartening to see it can be found by anyone 23 years later. Unfortunately, I'm not surprised in the least that we allow this in the US.

Breza · on Nov 24, 2020

Have you tried to get your record purged?

mushbino · on Nov 24, 2020

I'm looking into that right now again. It's a pretty tedious process where the Governor has to personally approve the request. Right now there is a Republican governor in that state so they're less likely to approve it. Since background checks can only go back 7 years, 10 in some states, I let it go and decided it wasn't worth it considering that. I thought it was behind me, but this definitely changes things. Thanks for mentioning it.

richardbarosky · on Nov 19, 2020

The general idea has various problems. For example, would newspapers or accounts of things/people, in various media, of objectively public information be required to be retroactively removed from any mention? Does it make sense to force and dictate what entities/individuals can do with basic information at the discretion of anyone who doesn't like it? Just a few thoughts. The records exist in the database because they are public information. If a record is removed from public view, that's done when requested because it's the right thing to do, although there is no strict legal obligation to do so.

colejohnson66 · on Nov 19, 2020

How does Europe’s “right to be forgotten” handle it?

richardbarosky · on Nov 19, 2020

Not sure. Maybe if someone doesn't like the Google search results, they make a complaint, and Google has to do what they want.

There are many other public records databases that have similar data, including the federdal judiciary and many state courts across the country.

Some are listed on the info page: https://www.judyrecords.com/info

120bits · on Nov 19, 2020

From the the reddit link[1]

Sorry, I'm just curious.

It says MySQL 8 and Elasticsearch 7.8. I don't have much experience in elasticsearch, I wanted to know how does elasticsearch makes it faster? Is it like an extension that makes it faster? Or Elasticsearch has its own data store that consumes data from the database and magically makes it faster?

Thanks.

[1]https://www.reddit.com/r/programming/comments/jg4rkv/how_a_s...

rpedela · on Nov 19, 2020

Elasticsearch, Lucene under the hood, implements an inverted index which is an extremely fast data structure for text search. ES has clustering as a primary feature too and many search features that can significantly improve relevance that you won't find in MySQL and most other databases.

devy · on Nov 19, 2020

Have you tried Toshi[1] or MeiliSearch[2]. I wonder how it would compare in terms of operational costs (monthly cloud hosting bill) at the current data size.

[1]: https://news.ycombinator.com/item?id=18895655

[2]: https://news.ycombinator.com/item?id=22685831

jabo · on Nov 19, 2020

Do you have the structured dataset somewhere? I’d love to index it in Typesense [1] and see how it does.

I recently tried a 32M songs dataset [2] and it works great, so I’m on the lookout for larger datasets to benchmark with.

[1] https://news.ycombinator.com/item?id=22181437

[2] https://songs-search.typesense.org/

lolive · on Nov 19, 2020

Plus it does not accept joins. So you basically have to denormalize all your data before injecting into Elastic. It helps speedup things. But is a headache to manage on a day to day basis.

rpedela · on Nov 19, 2020

Yeah. What I do is create a view that does all the joins then the middleware just needs to do "SELECT * FROM my_view". If the DB has good JSON support, I will also convert the data into an ES index request with SQL so the middleware becomes even simpler.

lolive · on Nov 20, 2020

Let’s say you have a mostly read-only DB (otherwise things are different).

Does it work when the view is insanely big? [i am not an expert at DBs, so my vision of a big DB might amuse you, but let’s say I have millions of rows to assemble as a view].

arminiusreturns · on Nov 19, 2020

How would you say it stands up to splunk these days?

richardbarosky · on Nov 19, 2020

Elasticsearch is a search platform. A "database" but meant for search stuff. It's not part of MySQL.

nerdponx · on Nov 19, 2020

https://news.ycombinator.com/item?id=25152925

scottydelta · on Nov 20, 2020

The gist is Elasticsearch is a full-index database. Whatever data goes in gets indexed as compared to only indexing certain fields in MySQL on which you perform search frequently. Think of Elasticsearch as MongoDB + full-indexing. It's a document storage with blazing fast search and aggregation.

_gtly · on Nov 19, 2020

https://www.courtlistener.com has more useful features and is part of the Free Law Project.

richardbarosky · on Nov 19, 2020

I've noted CourtListener on the info page: https://www.judyrecords.com/info

"PACER notwithstanding, CourtListener is the most powerful case law research tool available online — and in many ways is much more powerful."

This is based on CourtListener's 4 million+ written court opinions, which judyrecords has recently integrated. But you're right, CourtListener has more case law research features.

HoverSausage · on Nov 19, 2020

I just managed to find the home address of a YouTuber I'm a fan of in 15 seconds. Creepy site. Glad I'm not in the US

richardbarosky · on Nov 19, 2020

Interesting point. If you know the state where someone lives, you can look up the same info on the government website. Additionally, many many other databases have same public data but they ask for a payment to search.

Farfromthehood · on Nov 20, 2020

Whoa. Searched my name and found my sealed court records from when I was barely a teen (25 years ago). The records even state "sealed/exempt from public" at the top.

This is pretty neat as I have never seen the records, even though I requested them (out if curiosity) 10 years ago, only to be told they had been destroyed years prior.

richardbarosky · on Nov 20, 2020

Is the case available on the court portal? You can email me richardbarosky@gmail.com if its been removed since it was retrieved.

lostlogin · on Nov 20, 2020

Can you explain why the poster would want to email you?

Edit: I think I got there eventually - this is something you made?

l3s2d · on Nov 19, 2020

Does anyone know if the race data is used for crime statistics? I did a quick sample of people I know, and almost every South Asian was miscategorized as Black or White.

richardbarosky · on Nov 19, 2020

The race data in court case records is very often used for crime statistics. It's probably the most analyzed data point after what the incident was about.

dkn775 · on Nov 21, 2020

No it’s likely not, arrest records are separate from traffic citations and are two different databases. Also, your race may come from the cop filling out the ticket, or it may come fr9m your license in more advanced jurisdictions. The source for crime statistics is usually not court records, those are held by the courts.

jaybna · on Nov 19, 2020

Wish we would have known about this years ago. One search would have prevented the hire is someone that ended up costing us a ton of money. Most background searches don’t get local or state court cases like this without major expense that small businesses can’t afford.

richardbarosky · on Nov 19, 2020

Many similar databases and people finder sites are behind a paywall. There are a lot of positives to being able to use public records data available to make more informed decisions, whether it's to let your kid stay at someone's house you don't know or whatever it might be. Thanks.

r3trohack3r · on Nov 19, 2020

Interested in how large this dataset is?

Is it in a format that could be backed up by a community to protect? Seems like something folks in /r/datahoarder would be interested in backing up.

richardbarosky · on Nov 19, 2020

15KB is maybe the average case size, including HTTP request data.

That's 1024 * 15 * 439,000,000 = 6.7TB roughly.

The cases are all compressed, so I'm not using 6.7TB non-compressed for cases. But there are other request and non-request related records needed too. Just my backups currently.

loxias · on Nov 19, 2020

Being as you're offering use of the site for free, would you be open to the idea of also offering publicly available DB dumps? There's plenty of fun projects that I can imagine doing if I had that data locally.

nerdponx · on Nov 19, 2020

One Reddit user estimated the monthly cost of this site at over $2000 USD. How are you funding that?

https://www.reddit.com/r/programming/comments/jg4rkv/comment...

richardbarosky · on Nov 19, 2020

I've downgraded from that. I talked about that in that post. It was most definitely a knee-jerk reation to getting slashdotted on a popular subreddit and not wanting that to happen again. However, still on some very good hardware and handling current workload pretty well right now. That estimate was high.

ethbr0 · on Nov 19, 2020

Bullet points on what you downgraded to cut costs? Curious technical minds want to know.

richardbarosky · on Nov 19, 2020

Sure, I'll post after the dust settles. Server getting smashed but still handling searches pretty dang well.

Some sites crash from the page views, and here I have to handle everyone searching 400 million documents too.

vlmutolo · on Nov 19, 2020

Odds are this won’t help you, but just in case you haven’t seen it.

https://blog.burntsushi.net/transducers/

LunaSea · on Nov 19, 2020

Hmmm: https://www.judyrecords.com/record/1ikmhvbrhfa3a

csunbird · on Nov 19, 2020

333 N Warcraft Lane Undercity, Washington 99999

Looks like a place I would like to live in.

richardbarosky · on Nov 19, 2020

Ah, good one. There are many nuggets in there. "holy shit", fart, etc.

LunaSea · on Nov 19, 2020

Any idea how those came into the system?

The one I quoted seems to be some kind of test case?

richardbarosky · on Nov 19, 2020

Most likely, just like the asdf occurrences.

bpeebles · on Nov 19, 2020

I'm not sure why I didn't expect them to be in this database, but this also has like traffic tickets and similar.

onetimemanytime · on Nov 19, 2020

I understand the open court argument, we need to see what goes on so nothing funny happens there. But unless we're talking about a major crime, what good does it do to list and index on Google everything from 30 years ago?

I am no fan of this at all.

WindyLakeReturn · on Nov 19, 2020

If our society decides it is necessary to act with the full weight of the law behind it, then it would seem better to have the information available for the public to verify than not. I'm not saying it is all great, but that it is far better to have information available so that things like average sentence length for a given crime based on demographic and psychographic information can be queried by all. If a city that is 50/50 male/female and 20/80 black/non-black finds their speeding tickets are 70/30 male female and 35/65 black/non-black, then it may be worth investigating to see if police are being fair who they give warnings to, who gets reduces tickets, and who gets neither.

As for major privacy concerns, it is generally the more major crimes that have the larger issue with the victim being known. Knowing that some one was the victim of mischief vandalism is far less a privacy invasion than knowing they were the victim of sexual assault of a child (and even hiding the victim's identity often doesn't do more than hide the name from a passive search).

Then there are the benefits that other posters have raised, such as being useful for knowing past decisions used even in minor trials.

ghaff · on Nov 19, 2020

The general privacy issue that most jurisdictions have decided they just don't care that much about is that easy, indexed, free access to public records is different from the case where that same information is in a dusty file cabinet somewhere. There are a lot of things that people are, in principle, OK with being a matter of public record but are maybe less OK with their neighbor being able to casually discover it through Google.

distances · on Nov 19, 2020

Totally agree. I'd be all for open court records, requested in person, received in paper form against a small processing fee.

I do have a different cultural background so it's probably natural this feel horrible. Everything about this site would be so illegal in my home country it's almost hilarious in comparison. I'm used to (and fully approve of) a law that you can't keep a list of names in a notebook without a proper reason and everyone's consent, that would already be an illegal register.

ccostes · on Nov 19, 2020

So a Christmas card list would be illegal? That seems...excessive.

richardbarosky · on Nov 19, 2020

Good points.

If you look at the info page there is a specific example about how to look up codes of cases that had the same charge.

Being able to see how other offenders are sentenced is useful to make sure people are being treated fairly. Lawyers use this kind of data up to the point of producing analytics from data like that to understand outcomes. Major legal data companies have a large segment of business doing analytics for lawyers handling high and lower level cases.

Here are a few related links: https://cluesearch.org/ https://measuresforjustice.org/

bidnessmodell · on Nov 19, 2020

Worse, there's no obvious business model or disclosed funding source or institutional affiliation here.

That leaves me with the distinct impression that they're monetizing data about visitors and searches in some horrible way. (Data targeting for mugshot shakedown operations?)

I'm not going near this.

richardbarosky · on Nov 19, 2020

Maybe ads at some point.

Finnucane · on Nov 19, 2020

Sometimes even seemingly trivial cases can be caselaw precedent that people should be able to see and access without paying (they are public records).

richardbarosky · on Nov 19, 2020

Good point. PACER, in fact, has been called out by major news publications for literally being a scam the way the change for access to public records.

https://www.politico.com/magazine/story/2019/03/20/pacer-cou...

richardbarosky · on Nov 19, 2020

Only 3 pages are indexed on Google. Actually, most of the other legal databases (listed on info page) have their cases indexed on Google. However, judyrecords cases aren't indexed on Google. I understand your general sentiment.

iav · on Nov 20, 2020

How did you get google to index that many pages? For me Google only crawls about 1000 pages per day, no matter how many I show in the index

nautical · on Nov 19, 2020

Results like "MEETING ID" and "PASSWORD" for zoom meetings show up way more than any other video conferencing tool for 2020 cases.

nautical · on Nov 19, 2020

Many Zoom meetings are recurring and this might not be safe

programbreeding · on Nov 19, 2020

Looked at one record as an example and sure enough, the same meeting ID and password is found in 709 different cases in Cleveland, OH.

gfaure · on Nov 20, 2020

On a whim, I decided to search for "quicksort", and found a judgment where a loan company was trying to sue for infringement on the grounds that a competitor copied the SQL schema of their product. The complaint was upheld.

https://www.judyrecords.com/record/3nqi41qycaf9

lowercased · on Nov 20, 2020

wow...

"The Court finds that New Century had access to the SQL Data [pg. 536] Structures and that there is enough probative similarity to find that New Century factually copied the SQL Data Structures."

The next question might be to have 'Positive Software' demonstrate that they did not, in fact, take their table schemas from some place else. Like... textbooks? Or... example database schemas from vendors. Or tutorial sites? Or competing products?

There may be something extremely unique about part of their structure, perhaps, but... at the same time, there's often very little variety in how most similar data (crm/sales/lead gen/etc) might be stored to be remotely usable for reportin anyway.

"misappropriation of confidential information". Without seeing the structures in question it may be hard to say, but typically 'confidential info' is qualified with "not elsewhere available"-style clauses.

"... Likewise, the Court finds that there are more than one or a few ways to organize the data structures required for programs such as LoanTrack and LoanForce..."

Yeah, but usually there's only one good way to do stuff. Yes I could just have one row with 940 columns - technically, I could make my program work with that - but it's extremely suboptimal - regardless of whether I've seen anyone else's table structures or not.

ElijahLynn · on Nov 19, 2020

Interesting (obviously these aren't all the same person):

Page 1 of 1,763 total cases for: donald j. trump Page 1 of 2,299 total cases for: donald trump

txmachinery · on Nov 19, 2020

It's 80 cases when searching: "donald j trump"~4

This is a proximity search, to ensure it's actually turning up one of the various permutations of the name (as different court protocols may refer by surname first), rather than documents that just happen to contain each of the terms somewhere.

For fairness, "hillary rodham clinton"~4 turns up 193 cases.

Relevant doc: https://www.judyrecords.com/info (down the page, under "proximity search")

FpUser · on Nov 19, 2020

I am not fond exposing this kind of info. Don't we all have enough prying eyes

richardbarosky · on Nov 19, 2020

I've mentioned other legal databases on the info page. It's public information. judyrecords is the largest free database of court cases, but there are many other free/not free ones as well.

FpUser · on Nov 19, 2020

I did not mean this one in particular. Just my opinion about the subject in general.

MeinBlutIstBlau · on Nov 19, 2020

In my state you can get some kind of understanding of whats going on, but it's so legalese vague that half the time you only know if someone got a speeding ticket, underage, or divorced.

1vuio0pswjnm7 · on Nov 20, 2020

Since session cookies are required, here is a simple script for judyrecords searching from command line. It uses links browser and tmux.

    #!/bin/sh

    # usage: 1.sh [query] -- perform search
    # usage: 1.sh -- process results page 1 to 200
    # usage: n=5 1.sh -- process results page 5 to 200
    # usage: n=201 1.sh -- quit

    # start tmux if not already running then detach

    j=https://www.judyrecords.com;
    case $# in
    1)
    tmux set set-remain-on-exit on;
    tmux neww links;
    tmux send g $j/addSearchJob?search="$@" c-m;
    sleep 1.5;
    tmux send d;
    tmux capturep -p|sed -n /./p;
    tmux send g $j/getSearchJobStatus c-m;
    sleep 1.5;
    tmux send d;
    tmux capture -p|sed -n /./p;
    ;;0)
    test $n||n=1;while true;do test $n -le 200||break;
    tmux send g $j/getSearchResults?page=$n c-m;
    sleep 2;
    tmux send Down Down '\' 
    # small monitor where results page HTML takes 4 spacebar presses to get to bottom; 
    m=0;while true;do test $m -le 3||break;
    # process results -- e.g., print record URLs;
    tmux capturep -p|sed -n "/href=..record/{s|.*record.|$j/record/|;s/\"//;p;}";
    tmux send Space;
    m=$((m+1));done;n=$((n+1));done;
    tmux killw;
    esac

1vuio0pswjnm7 · on Nov 23, 2020

Updated and improved

    #!/bin/sh

    j=https://www.judyrecords.com;
    case $# in
    1)
    tmux new -P -d links;
    tmux set set-remain-on-exit on;
    tmux send g;
    tmux send $j/addSearchJob?search="$@";
    tmux send c-m;
    sleep 1.7;
    tmux send d;
    tmux capturep -p|sed -n /./p;
    tmux send g;
    tmux send $j/getSearchJobStatus;
    tmux send c-m;
    sleep 1.7;
    tmux send d;
    tmux capture -p|sed -n /./p;
    ;;0)
    test $n||n=1;while true;do test $n -le 200||break;
    tmux send Down 
    tmux send Down 
    tmux send g 
    tmux send $j/getSearchResults?page=$n 
    tmux send c-m;
    sleep 2;
    tmux send Down 
    tmux send Down 
    tmux send Escape;
    tmux send F 
    tmux send v 
    tmux send c-u 
    tmux send 1.htm 
    tmux send c-m 
    tmux send o 
    sed -n "/href=\"\/record/{s,.*record\/,$j/record/,;s,\",,;p;}" 1.htm;
    __grepq=$(exec sed -n '/a class=\"goToNextPage/!d;=;q' 1.htm);
    test ${#__grepq} -gt 0||break;
    n=$((n+1));
    done;
    tmux killw ;
    esac

op03 · on Nov 20, 2020

Interesting. Haven't seen links being used in a while. Thanks for posting.

jaequery · on Nov 19, 2020

What a clean interface. We need more website to look like this.

richardbarosky · on Nov 19, 2020

Thank you

nojvek · on Nov 20, 2020

Weapons of math destruction. This would be one of them. The data here is emvarassing for individuals and it can be looked up for decades in history.

I know this was always public but this makes it too easy for masses to dig through the troves.

Scares me. Next thing I see is some AR glasses that do facial recognition and correlate name -> public records. Could be a nasty blackmail tactic. Some things are close to Black mirror in reality.

caseyscottmckay · on Nov 20, 2020

So the data should be public (as it always has been), but it should not be easily accessible?

nojvek · on Nov 22, 2020

No. Data should be erased from records after a couple of years for individuals with pretty crimes.

pashabitz · on Nov 19, 2020

I wasn't trying to be an asshole, just honestly searched for "javascript". Was disappointed :)

richardbarosky · on Nov 19, 2020

Looking at the results, those all appear to be from CourtListener's bulk data.

Thorrez · on Nov 20, 2020

There are only 532 cases, so it's not too bad.

bflesch · on Nov 19, 2020

The search is very quick. Does anybody know how their tech stack looks like?

richardbarosky · on Nov 19, 2020

https://www.reddit.com/r/programming/comments/jg4rkv/how_a_s...

kordlessagain · on Nov 19, 2020

From Reddit thread:

> MySQL 8 is used for DB. The seach server uses elasticsearch 7.8.

jillesvangurp · on Nov 19, 2020

Sounds like that would be an easy use case for elasticsearch indeed. I've seen it handle much bigger data sets. Solr would work as well. There are probably a few other options on the market but elasticsearch would probably do pretty well on this even without a lot of tuning.

For reference, I once threw the entirity of open streetmaps at it before it even hit 1.0 to implement a simple reveres geocoding thing. Basically a couple hundred million street segments, some polygons, etc. At the time the geospatial support wasn't great and very new and very CPU intensive. I got away with indexing all of that and running it on a single node cluster with a xeon and 32G of RAM and spinning disk (RAID 1, no SSD). It worked great. Very responsive. Indexing only took about 50 minutes or so. Most of that was my parsing logic. That's not comparable of course, I'd expect this to be faster on the same hardware with a current version of Elasticsearch. They've made a lot of leaps with improving performance, memory usage, cpu usage, disk usage, robustness, etc. in the 7 major versions since then.

caseyscottmckay · on Nov 19, 2020

This is courtlistener.com data correct?

richardbarosky · on Nov 19, 2020

From other comment: CourtListener has about 4 million opinions, which are included. On top of that, 435 million additional cases from throughout the US.

caseyscottmckay · on Nov 20, 2020

Interesting. Any stats/aggregations on numbers per resource type (e.g., 56k for scotus, 36k for D.C. Circuit Court etc)?

zaroth · on Nov 19, 2020

Great site, fast and simple.

When I type a query and press search, would like it if the URL updated with the search in the query string. It would make it easier to share specific queries.

richardbarosky · on Nov 19, 2020

Good point, that would definitely be an improvement. Thank you.

powerbook5300CS · on Nov 19, 2020

This is awesome. Where did the data come from?

richardbarosky · on Nov 19, 2020

Thanks. All the data is collected from various government databases.

aVx1uyD5pYWW · on Nov 20, 2020

How did you do that? Did you have to implement a scraper for each county?

x87678r · on Nov 19, 2020

This is where its nice to have a common name. Honestly its worth changing your first and last name to something generic.

ikeboy · on Nov 19, 2020

I don't see a breakdown by source. What does this have that courtlistener doesn't, for example?

richardbarosky · on Nov 19, 2020

CourtListener has about 4 million opinions, which are included. On top of that, 435 million additional cases from throughout the US.

ikeboy · on Nov 19, 2020

Where are they getting public domain opinions that CL doesn't have? Are these states or counties that CL doesn't scrape? It would be nice to have a breakdown by jurisdiction.

Also, by "case" do you mean "opinions"?

Full disclosure, I've written and contributed to several scrapers for CL, and if there's a large source they're missing I'd like to know.

Note that the CL opinion number you're quoting doesn't include orders from Federal courts that are in the RECAP collection, which accounts for several million additional opinions.

tomorrowfuture · on Nov 19, 2020

trellis.law does something similar

their searches are indexed and have rulings and documents as well.

does this differ from that service?

chris_f · on Nov 19, 2020

Congratulations on the launch. I have worked in open source and public record research for the last 15 years, and your coverage is extremely impressive.

Do you have any long term plan for the site? I can see this going in a lot of different directions depending on your goals.

richardbarosky · on Nov 19, 2020

Thanks, as far as I know it's the largest database of court cases on the Internet. If there's enough traffic I'll support the site with ads. Don't have any other specific plans currently.

iav · on Nov 20, 2020

I run a similar free site and was looking at add ads. Google Adsense rejected it for not complying with their program policies. My data is on large US federal bankruptcies, so I really couldn’t pin point why but just a heads up that it might be more difficult.

vmception · on Nov 19, 2020

Fascinating! Was surprised to see random infractions

Does this have the lower trial court records too?

richardbarosky · on Nov 19, 2020

Yes, it has records from different trial courts.

vmception · on Nov 19, 2020

Is there a list of jurisdictions and courts that it has?

squid_demon · on Nov 20, 2020

Glad the record I got expunged from 20+ years ago is in there. Really nice.

nkw · on Nov 19, 2020

You might consider giving credit to the sources of data used to make this.

richardbarosky · on Nov 19, 2020

All the data is from government databases directly, aside from CourtListener, which was recently integrated. It would be good to specifically mention CourtListener's contribution.

aschatten · on Nov 19, 2020

How did you get all that data from government databases directly? Do they provide some sort an API for bulk export?

wtvanhest · on Nov 19, 2020

It is all be public records. The source of the original data is the court system. If a 3rd party physically scrapped it from the court system, others should be able to digitally scrape it.

kordlessagain · on Nov 19, 2020

It's throwing a 500 for some regex I fed it.

richardbarosky · on Nov 19, 2020

I recently added advanced query support. Looks like I need to clean up some validation. Thanks.

rcpt · on Nov 20, 2020

This is great I had no idea how many times people tried to end Prop 13 using the courts or how litigious HJTA was.

Hard to read on mobile though

leonardoeloy · on Nov 20, 2020

Search for "analytics" and then you realize that there is a collection company that sues the hell out of people.

visarga · on Nov 19, 2020

This dataset would be yummy for GPT3

richardbarosky · on Nov 19, 2020

Interesting, I'll check it out. Thanks for the link.

ergwwrt · on Nov 19, 2020

Didn't Aaron Swartz try to do this but couldn't because it costs $0.10 per page?

richardbarosky · on Nov 19, 2020

From what I understand, he had some kind of academic library access for PACER and used that to bypass what others would be changed for. There are lawsuits against PACER charging fees for what's public information generated by taxpayer money. He ended up being charged with various crimes related to maybe computer fraud and eventually committed suicide. A very sad story.

pachico · on Nov 19, 2020

Jeez, 36 pages with resume for "Napster"!

wil421 · on Nov 20, 2020

Wow a seat belt violation from 2004!

m3kw9 · on Nov 19, 2020

What stack did you use for this?

richardbarosky · on Nov 19, 2020

See other comment link to reddit.

vmception · on Nov 19, 2020

lol so many records that should have been destroyed and not indexable!

so do I get a court order for each county, the website, the resyndicating source that the website uses or what?

I looked at the reddit page and other people noticed the same thing, the author just said send me the link! Hahaha one by one removal maybe!

Shut it down, enjoy it while it lasts

richardbarosky · on Nov 19, 2020

I don't think you understand what you're talking about. There are many databases that are made up of public records. Many aren't free, some are.

vmception · on Nov 19, 2020

That may be the reality but if the court or due process ordered something expunged from a record it should be updated in all records and the details not present.

Should just do a search for expunged or similar terms and remove those entries.

chrisseaton · on Nov 19, 2020

> lol so many records that should have been destroyed and not indexable!

You want secret courts?

jschwartzi · on Nov 19, 2020

Do you want things that children do to follow them for the rest of their lives?

tshaddox · on Nov 19, 2020

Well, no. Are there names of minors in this database? I thought the US had a mechanism to prevent that, or at least to petition to have records of minors removed or anonymized.

lazyasciiart · on Nov 19, 2020

Yes. The mechanisms are shit. Many of these cases are juvenile cases with a note saying the case is sealed, along with full details of the charge, name, and outcome.

Edit: wow, plus family court stuff like a four year custody dispute, kids being adopted, etc

vmception · on Nov 19, 2020

"The US" has 39,044 distinct local governments and municipalities and they all do their procedural nuances differently and to varying efficacy and different points in time! :D

chrisseaton · on Nov 19, 2020

I don't know what culture you come from, but in the US and UK and similarly influenced cultures justice being seen to be done and recorded is a pretty important principle and mechanism against overreach of the state.

dadrock · on Nov 19, 2020

I agree. But it's worth noting that the UK has recently enacted the Right to be Forgotten Law, which plays into this discussion.

ghaff · on Nov 19, 2020

Of course, once data is replicated and distributed around, it's very hard to put the genie back into the bottle.

lazyasciiart · on Nov 19, 2020

There are significant limits to that, such as juvenile courts.

matz1 · on Nov 19, 2020

The following itself is not the issue right?

vmception · on Nov 19, 2020

they weren't secret and were available for public perusal and judgement until the designated time

secret courts have cases that are secret from the beginning

aVx1uyD5pYWW · on Nov 20, 2020

how was this data acquired? did you scrape government websites for this data?

justinzollars · on Nov 19, 2020

This is great. Its like google before it become evil.

richardbarosky · on Nov 19, 2020

Thank you

xyst · on Nov 20, 2020

hahaha - just found dirt on like 1/4 of the people I know.

whoaWtf · on Nov 19, 2020

[flagged]

josefresco · on Nov 19, 2020

Wow, I really wish I hadn't jumped down that rabbit hole.

richardbarosky · on Nov 19, 2020

Feel free to join in if you're interested in gender dyanmics stuff

kontxt · on Nov 19, 2020

Super cool--and very fast! Anyone looking to collaborate on these can easily add Kontxt (https://www.kontxt.io) right on to them and have localized discussions directly on page-parts.

richardbarosky · on Nov 19, 2020

Thanks. I saw your post on reddit a while back. Was going to ask about your tech stack.

kontxt · on Nov 20, 2020

I used React client-side, Node server-side, and MySQL as the db. I only mentioned Kontxt here because I demoed it for Thomson Reuters because it could be helpful for their legal professionals as a collaboration tool after they find documents via their WestLaw legal search product, and your tool reminded me of it. I actually used Kontxt as a sales pitch to highlight their annual report and add some calculations and explanations about how much money they could make. Nice work, again!