More

barthelomew · 2025-11-11T18:02:27 1762884147

You can get bonuses though. The optimization algorithms do not get bonuses.

barthelomew · 2025-10-13T11:23:24 1760354604

Paper LaTeX files often contain surprising details. When a paper lacks code, looking at latex source has become a part of my reproduction workflow. The comments often reveal non-trivial insights. Often, they reveal a simpler version of the methodology section (which for poor "novelty" purposes is purposely obscured via mathematical jargon).

seg_lol · 2025-10-13T12:21:37 1760358097

Reading the LaTex equations also makes for easier (llm) translation into code rather than trying to read the pdf.

barthelomew · 2025-10-04T21:16:44 1759612604

Yep! Datalog syntax for Z3 is pretty neat! We used SMT [1] in our grammars paper because it allowed the most interoperability with solvers, but our technique also works with PROLOG; as tested our at the behest of reviewers at NeurIPS. I would assume that this should also work with datalog [2].

[1] https://arxiv.org/abs/2505.20047 [2] https://github.com/antlr/grammars-v4/blob/master/datalog/dat...

barthelomew · 2025-10-04T20:47:39 1759610859

Hey! Thank you for the interest! I shall do that. Meanwhile, check out Page 11 onwards. We describe a lot of situations! (https://arxiv.org/pdf/2409.17270)

pstoll · 2025-10-05T00:18:48 1759623528

Upvoting the comment that the gitrepo would be way more self stand-alone if it had an intro of the DSL.

barthelomew · 2025-10-04T20:08:21 1759608501

Yep, this is a genuine problem, and this is what we term as the autoformalization gap in our follow up paper. (https://arxiv.org/abs/2505.20047)

Some LLMs are more consistent between text and SMT, while others are not. (Tab 1, Fig 14,15)

You can do uncertainty quantification with selective verification to reduce the "risk", for e.g. shown as the Area Under the Risk Coverage Curve in Tab 4.

YeGoblynQueenne · 2025-10-05T10:24:48 1759659888

Well, if you understand that this is a "genuine problem" then what have you done to solve it? A quick look at the abstract of your follow up paper does not reveal an answer.

And let me be clear that this is a major limitation that fundamentally breaks whatever you are trying to achieve. You start with some LLM-generated text that is, by construction, unrelated to any notion of truth or factuality, and you push it through a verifier. Now you are verifying hot air.

It's like research into the efficacy of homeopathic medicine and there's a lot of that indeed, very carefully performed and with great attention to detail. Except all of that research is trying to prove whether doing nothing at all (i.e. homeopathy) has some kind of measurable effect or not. Obviously the answer is not. So what can change that? Only making homeopathy do something instead of nothing. But that's impossible, because homeopathy is, by construction, doing nothing.

It's the same thing with LLMs. Unless you find a way to make an LLM that can generate text that is conditioned on some measure of factuality, then you can verify the output all you like, the whole thing will remain meaningless.

barthelomew · 2025-10-04T20:02:29 1759608149

Yep. The paper was written last year with GPT-4o. Things have become a lot better since then with newer models.

E.g. https://arxiv.org/pdf/2505.20047 Tab 1, we compare the performance on text-only vs SMT-only. o3-mini does pretty well at mirroring its text reasoning in its SMT, vs Gemini Flash 2.0.

Illustration of this can be seen in Fig 14, 15 on Page 29.

In commercially available products like AWS Automated Reasoning Checks, you build a model from your domain (e.g. from a PDF policy document), cross verify it for correctness, and during answer generation, you only cross check whether your Q/A pairs from the LLM comply with the policy using a solver with guarantees.

This means that they can give you a 99%+ soundness guarantee, which basically means that if the service says the Q/A pair is valid or guaranteed w.r.t the policy, it is right more than 99% of the time.

https://aws.amazon.com/blogs/aws/minimize-ai-hallucinations-...

bhk · 2025-10-05T03:54:21 1759636461

Re: "99% of the time" ... this is an ambiguous sample space. Soundness of results clearly depends on the questions being asked. For what set of questions does the 99% guarantee hold?

cerved · 2025-10-04T23:14:21 1759619661

Who makes the rules?

barthelomew · 2025-10-04T19:48:33 1759607313

Hey, yes! This is because the DSL (Domain Specific Language) is pretty complex, and the LLM finds it hard. We prototype a much more effective version using SMT in our NeurIPS 2025 paper (https://arxiv.org/abs/2505.20047). We shall soon open source that code!

barthelomew · 2025-10-04T19:46:17 1759607177

Hey there! I mostly designed and wrote most of the actual interpreter during my internship at Microsoft Research last summer. Constrained decoding for GPT-4 wasn’t available when we started designing the DSL, and besides, creating a regex to constrain this specific DSL is quite challenging.

When the grammar of the language is better defined, like SMT (https://arxiv.org/abs/2505.20047) - we are able to do this with open source LLMs.

sigmoid10 · 2025-10-04T19:56:40 1759607800

What are you talking about? OpenAI has supported structured json output in the API since 2023. Only the current structured output API was introduced by OpenAI in summer 2024, but it was primarily a usability improvement that still runs json behind the scenes.

barthelomew · 2025-10-04T20:28:01 1759609681

You're right about the 2023 JSON mode, but our project required enforcing a much more complex DSL grammar (look in Appendix for details), not just ensuring a *valid JSON object*. The newer structured output APIs are a significant improvement, but the earlier tools weren't a fit for the specific constraints we were working under at the time.

dang · 2025-10-04T20:02:18 1759608138

> What are you talking about?

Please edit out swipes like this from your HN comments—this is in the site guidelines: https://news.ycombinator.com/newsguidelines.html. It comes across as aggressive, and we want curious conversation here.

Your comment would be fine without that bit.

sigmoid10 · 2025-10-04T20:05:26 1759608326

This is not meant as snide, I'm literally confused if I might have misunderstood the problem here. Because the solution would be so obvious.

dang · 2025-10-04T21:06:46 1759612006

I believe you! but when an internet reply leads with "what are you talking about?", it's likely to pattern-match this way for many readers. If that's not your intent, it's best to use an alternate wording.

vasco · 2025-10-05T06:54:24 1759647264

Not to be rude, but they clarified it's not a snide, why are you trying to control speech to this degree? If we don't like his tone we can downvote him as well anyway and self regulate.

dang · 2025-10-05T15:49:09 1759679349

They clarified that their intention was good, but intent doesn't communicate itself—it needs to be disambiguated [1]. What matters in terms of moderation is not intent, but effects, i.e. effects on the system in the general case [2].

Arguably your question reduces to: why does HN have moderators at all? The answer to that is that unfortunately, the system of community + software doesn't function well on its own over time—it falls into failure modes and humans (i.e. mods) are needed to jig it out of those [3]. I say "unfortunately" because, of course, it would be so much better if this weren't needed.

You can't assess this at the level of an individual interaction, though, because it's scoped at the whole-system level. That is, we can (and do) make bad individual calls, but what's important is how the overall system functions. If you see the mods making a mistake, you're welcome to point it out (and HN users are not shy about doing so!), and we're happy to correct it. But it doesn't follow that you don't need moderators for the system to work, or even survive.

[1] https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

[2] https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

[3] https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

barthelomew · 2025-10-04T19:42:42 1759606962

Yep! I have read your work! Pretty cool! I also worked on a similar deep research agent for autoformalization this summer at AWS ARChecks, building on similar patterns.

Although that work is not public, you can play with the generally available product here!

[1] https://aws.amazon.com/blogs/aws/minimize-ai-hallucinations-...

barthelomew · 2025-10-04T19:39:22 1759606762

The autoformalization gap is pretty difficult to bridge indeed. We explored uncertainty quantification of autoformalization on well-defined grammars in our NeurIPS 2025 paper : https://arxiv.org/abs/2505.20047 .

If you ever feel like chatting and discussing more details, happy to chat!