Hacker Newsnew | past | comments | ask | show | jobs | submit | more chaps's commentslogin

For my workflows, layout extraction has been so inconsistent that I've stopped attempting to use it. It's simpler to just throw everything into postgis and run intersection checks on size-normalized pages.


Interesting. What kind of layout do you have?

My documents have one or two-column layouts, often inconsistently across pages or even within a page (which tripped older layout detection methods). Most models seem to understand that well enough so they are good enough for my use case.


Documents that come from FOIA. So, some scanned, some not. Lots of forms and lots of hand writing to add info that the form format doesn't recognize. Lots of repeated documents, but lots of one-off documents that have high signal.


I'd be very curious what works well with FOIA historical documents that have been scanned by hand with redactions by markers & etc.


I like to use textual anchors for things like, "line starts with" or "line ends with" or "file ends with" and combining that with levenshtein distance with some normalization stuff (combining adjacent strings in various patterns to account for OCR wonkiness). Turns into building lists of anchors that can be built off of. Of all the things I've tried, including things like image hashing and such, it's been the most effective generalized "tool".

But also, I hold the strong philosophy that it's important to actually read the documents that are being scanned. In that way, OCR tends to be more of a procedural step than anything.

Really, it ultimately depends on your goals.


Tesseract v4 when it was released was exceptionally good and blew everything out of the water. Have used it to OCR millions of pages. Tbh, I miss the simplicity of tesseract.

The new models are similarly better compared to tesseract v4. But what I'll say is that don't expect new models to be a panacea for your OCR problems. The edge case problems that you might be trying to solve (like, identifying anchor points, or identifying shared field names across documents) are still pretty much all problematic still. So you should still expect things like random spaces or unexpected characters to jam up your jams.

Also some newer models tend to hallucinate incredibly aggressively. If you've ever seen an LLM get stuck in an infinite, think of that.


I used Tesseract v3 back in the day in combination with some custom layout parsing code. It ended up working quite well. When looking at many of the models coming out today the lack of accuracy scares me.


And I'm getting tired of these comments that normalize the awfulness of the past. We can be pragmatic in recognizing that "our guys" also did bad things. Less bad than awful is still bad. If we choose not to recognize our own foibles then we just fall down our old patterns of "it's someone else's problem".

Because otherwise, better than what we have now is an abysmal target and we should aim for better.


> We can be pragmatic in recognizing that "our guys" also did bad things

What do you mean "our guys" ? I don't have guys. I consider myself a libertarian, was both sidesing up until June of 2020, and had never voted for a major party in a national election until 2020 when I voted for Biden - which I view as me getting older and more conservative - aka valuing our societal institutions and values after seeing how much Trump openly trashed them instead of showing an ounce of leadership during Covid.

Even with this perspective, I still think it is foolish to write off the current administration as if it's just another iteration of back and forth corruption rather than a shameless wholesale kicking over of the apple cart.


> I still think it is foolish to write off the current administration as if it's just another iteration of back and forth corruption

You are deeply, deeply misunderstanding my point if this is what you got from my post.

"Our guys" was tongue in cheek.


Care to elaborate on your point then? Reading what you have written, I do agree with the abstract thrust of where you're coming from.

But I have also observed that the destructionists appeal to similar lofty ideas to justify what is currently going on - eg accelerationism.

(I also don't know what difference "tongue in cheek" makes. I've never looked at the government and thought anything like "these people represent me and work for my interest". I know a lot of people seemingly have, but that's not me. But I did look at the Biden administration, which I voted for, and think "this is the stable predictable evil I (and the rest of American society) already know how to cope with".)


Well, for background, my background is in investigative journalism with a focus on policing, technology, and transparency. I've been the plaintiff in a bit over 10 FOIA lawsuits and have three ongoing suits now. "Our guys" was more meant to be a hand waived ideal of what each in-person thinks their out-person is.

My point can be read as a recognition that ratcheting happens within the boring minutia and work is rarely done to recover from those ratchetings. Things like continuation of prosecution policies, legislation changes, staff changes, etc. There's a very strong tendency to consider those sorts of ratcheting effects as "just how things are" rather than recognizing that no, it hasn't always been that way.

Like, progressive politicians love to talk a big game about transparency, but when it comes down to it, they themselves contribute to systemic transparency failures. See Chicago's past two mayors' campaign transparency promises. Both have done a complete 180 on those promises and use never-losing lawyers to enforce that sort of thing. Chicago's mayor's office once asked me to do analysis of parking tickets' effects on poor folk... then a few months later accused my wanting a data dictionary of the parking tickets system so that I could modify the parking ticket system's data. That led to bad case law at the IL supreme court.

It's shit like that. The small-but-not-really-small things.


Like another poster said, this is very well known already. It's one of the reasons why municipalities purchase this data from data brokers.


Is that not still incredibly vulnerable to timing attacks?


Maybe I’m mis-interpreting what you mean, but without a notification when a message is sent, what would you correlate a message-received notification with?


Unfortunately a lot of investigations start out as speculation/vibes before they turn into an actual evaluation. And getting past speculation/vibes can take a lot of effort and political/social/professional capital before even starting.


Well yeah. If they had solid evidence at the start, why would they need an investigation?


It's not as obvious of an answer as it initially sounds. Coming at this from a stint in investigative journalism where even beginning an investigation requires getting grants and grants involve convincing other people that the money is going to good use. Also having been told that an investigation I ran was nothing by multiple editors that turned out to be something big... it really shifted how I perceive investigations and what it means to stick your neck out when everyone's telling you that something isn't happening even when it is.


This is definitely not a context problem. Very simple things like checking for running processes and killing the correct one is something that models like opus 4.5 can't do consistently correct... instead of recognizing that it needs to systematize that sort of thing -- one and done. Like, probably 50% of the time it kills the wrong thing. About 25% of the time after that it recognizes that it didn't kill the correct thing and then rewrites the ps or lsof from scratch and has the problem again. Then if I kill the process myself out of frustration it checks to see if the process is running, sees that it's not, then gets confused and sets its new task to rewrite the ps or lsof... again. It does the same thing with tests, where it decides to just, without any doubt in its rock brain, delete the test and replace it with a print statement.


Having done app support across many environments, um - yes, multiple microservices is usually pretty simple. Just look at the open file/network handles and go from there. It's absolutely maddening to watch these models flail in trying to do something basic as, "check if the port is open" or "check if the process is running... and don't kill firefox this time".

These aren't challenging things to do for an experienced human at all. But it's such a huge pain point for these models! It's hard for me to wrap my head around how these models can write surprisingly excellent code but fail down in these sorts of relatively simple troubleshooting paths.


They have code in training data, and you have e.g. git where you can see how the code evolved, and they can train on PR reviews on comments.

There isn't much posted in the way of "bash history and terminal output of successful sysadminning" on the web


I'm not sure that finding and killing the correct process is something I'd consider to be a "sysadmin task". That's something you learn in the first day of just about any linux course/primer and there are many examples of its use online.

It's more that the default is to overuse tools that cast too-wide nets like pgrep and pkill. And it doesn't know how to use the output well enough. Like, when these systems do ps, it identifies random processes in the list instead of identifying the most recent process that it, itself, started.

It's as if some SRE-type person decided to hard code pgrep and pkill because it's their personal preference.


Really glad they brought up Outer Wilds -- it's exactly the sort of game where the tiniest detail is a spoiler. Knowledge discovery's the game, so any piece of information about the game that doesn't need to be discovered is like cutting ahead to the next chapter in a game. Like playing on someone else's game file.

Wish someone would wipe my memories of that game so I can play it again.


> Wish someone would wipe my memories of that game so I can play it again.

Felt the same for years, now I am doing a new playthrough.

I figured, of course I know the solution to the puzzle, but I am hard pressed to remember all the details of how I uncovered that answer, and I know that you can uncover the clues in nearly any order so I know this playthrough will be new in its own way.

And I miss the world, and the gameplay.


When talking to someone at-risk of deportation earlier in the year, they asked me, "Why should I do anything differently? Obama and Biden did the same exact shit."

And there's a lot of truth to that which a lot of people need to reconcile with.

The fact that we don't have DACA solidified into a path towards citizenship by now is just sad.


And I agree with you, but that's not what I'm questioning. Given the 10x larger scale of deportations during the Obama's term, why were there no protests?


During Obama's term the practice of warrentless entry into actual citizens homes wasn't widespread.

During Obama's term the leaders of DHS / ICE were not blatently lying about events captured on film and evading legitmate investigations into deaths at the hands of officers.

During Obamas term people with no criminal record were not being offshored to hell-hole prison camps with serious abuses of human rights.



Can you link to the tweet in which Obama defended the agents right to threaten a child with rape?

From your linked article:

  If the abuses were this bad under Obama when the Border Patrol described itself as constrained, imagine how it must be now under Trump, who vowed to unleash the agents to do their jobs.
There's your difference. Thank you for playing.


The core issue is the media. I worked at a large news company in New York during the Obama’s term. There was a training for our reporters: anything negative about Obama was strictly prohibited. Ad revenue.


As many others have pointed out, the deeper issue is the size of the boot, the disregard for citizens rights, the extremes of the offshore gulags, the fevor with which the upper levels embrace the brutality.

I am unable to assist further with your stated struggle for comprehension.


Not to add fuel to the fire, but a lot of what you're saying is hard to take seriously when Obama himself's been known to brag about how good at killing he is.

You're right that things are significantly worse now, but it's important to recognize that what came before was still bad and in many ways is the foundation for where we are.

https://slate.com/business/2013/11/double-down-obama-said-he...


Thanks for the response, I'm happy to engage, although I almost missed this as you're well over the fold in my comment history and I have no mechanism for alerting me to replies (nor, I might add, am I looking for one).

With the preamble that I'm not a US citizen, have never thought to apply to be one, have been in and out of the US and many other countries a number of times, and don't play favourites with POTUS(n) on the basis of their asserted party ticket; ...

The upstream question and context here concerns differences between administrations wrt home soil immigration policy, to which I've been focused.

As points of note:

* Allegations of POTUS(X) boasting behind doors are a difference of behaviour from that of POTUS(Y) coming right out and stating they can freely kill in Times Square and get away with while glorifying the deaths of citizens in public and promising perpertrators they'll get away with it and have immunity.

* I'm no fan of remote double tap kills. Full stop. That said;

* POTUS(X) authorising kills in an "inherited" known and ongoing "war zone" known to all is distinct from POTUS(Y) authorising double tap kills from unmarked airframes of civilians in international waters prior to any declaration of war (via Congress or not).

* Regardless, the offshore behaviour of any POTUS is distinct from their behaviour toward their own citizens within their country.

In the arc of all the shitty behaviour by post WW POTUS(n) candidates, the current incumbent has significantly levelled up to achieve Kissinger level disregard for human life on home soil for purely political gain .. and played that hand badly.

That aside, I'm not a Communist - but I do admire Ash Sarkar's shut down of idiotic Obama / Trump faux dichotomy posings by a pompous right wing media pundit - https://www.youtube.com/watch?v=JD7Ol0gz11k

I equally admire our PM's "off the cuff" (approximately 15 mins rough note prep time) strip down of an opposition one time PM attempting to pin a third parties bad behaviour on the sitting government on the basis of them making no comment until after a Court case had completed (as per the law here) - https://www.youtube.com/watch?v=fCNuPcf8L00

It's not relevant to immigration policy, but it is a good example of off the cuff professional level political debate in sitting government.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: