Hacker Newsnew | past | comments | ask | show | jobs | submit | tompetry's commentslogin

Is the $200 price tag there to simply steer you to the $20 plan rather than the free plan?


Now this is what I call taste.


Cool concept. Anybody know of a project with realtime spontaneous interactions like this that fits a defined use case?


On isitchristmas.com, this happens for several days surrounding Christmas. The defined use case is informing you of whether it is Christmas.


Google Docs.


I've thought about this topic a lot before. The concept of a founder vs manager "mode" doesn't resonate with me, but the message boils down to this: founders have more freedom to approve high risk/high reward initiatives, while shielding everybody else on downside. It's "their company" and their job can stomach some turmoil in the name of winning.

Managers don't have this safety net; failures will be evaluated more critically and they probably aren't the types to win at all costs. So they are more focused on avoiding fatal mistakes that will lose them their jobs, but don't end up achieving as much on average.

In a sports analogy, founders play to win and managers play not to lose.

But it's a spectrum not black and white. Every founder has to delegate and every founder has to bring on great people to do the work and let them breathe. I think its about freely enabling bold projects while minimizing the fear and anxiety at all levels, and figuring out the decision making and execution structure that works the best.


>> By some estimates, more than 80 percent of AI projects fail — twice the rate of failure for information technology projects that do not involve AI.

So 40% of projects with more proven/experienced technologies fail? That's super high. Replace "AI" with any other project "type" in the root causes and sounds about right. So this feels more of a commentary on corporate "waste" in general than AI.


AFAIK, nearly 60% of software projects fail. That means that 40% don't, what is about double of the 20% they are reporting for AI.

That phrase you are quoting is probably a case of journalists being bad with numbers.


This was awesome


The single data point here is Adam Neuman, so I have a hard time taking this seriously.

I have raised 6 equity rounds as a founder of 2 companies. Never took a dime off the table, was never offered it, never asked for it. We actually did have early employees ask about it, and we encouraged them to not sell.

Why would you, especially at early stage valuations? You're either bad at math, or you know you're about to fail. And who is buying these secondary shares? I don't know a VC or angel who would "de-risk" an early founder like this; it's not aligned with their model. It also complicates QSBS status if I recall correctly.


> Never took a dime off the table, was never offered it, never asked for it

Well they certainly wouldn't volunteer the offer without you asking for it.


> The founder in this scenario was offered $400,000 of liquidity at Series A and $750,000 at Series B and encouraged to do so by their board of investors to de-risk their own life.

This is from the article. I would tend to agree with you.


I straight up don't believe the article. (Edit: not saying author is lying, but that they're extrapolating from bad data.) I've worked as employee #3 at one startup, co-founded another which achieved >$3bn valuation, and am now solo-founding a third. I've networked with lots of other founders. I've never, ever heard of a secondary liquidity offer in a Series A.

I think the paragraph above that quote explains it. They're talking about founders that "mortgaged their house and lived on ramen noodles for years." It actually sounds like they got screwed out of some equity. Rather than pay themselves a reasonable salary to support their lifestyle as they build the company, they instead traded equity for a one-time payment. That's a shitty deal, and I want to know who this predatory VC is so I make sure I never take money from them.


This so much. So many folks in this thread are talking about series B+ and only paying themselves under 100k/yr and that’s just a scam. Once you have institutional money you can just start paying yourself enough to live ~comfortably.


I've worked quite a bit with STT and TTS over the past ~7 years, and this is the most impressive and even startling demo I've seen.

But I would like to see how this is integrated into applications by third party developers where the AI is doing a specific job. Is it still as impressive?

The biggest challenge I've had with building any autonomous "agents" with generic LLM's is they are overly gullible and accommodating, requiring the need to revert back to legacy chatbot logic trees etc. to stay on task and perform a job. Also STT is rife with speaker interjections, leading to significant user frustrations and they just want to talk to a person. Hard to see if this is really solved yet.


I’ve found using logic trees with LLMs isn’t necessarily a problem or a deficit. I suppose if they were truly magical and could intuit the right response every time, cool, but I’d always worry about the potential for error and hallucinations.

I’ve found that you can create declarative logic trees from JSON and use that as a prompt for the LLM, which it can then use to traverse the tree accordingly. The only issue I’ve encountered is when it wants to jump to part of the tree which is invalid in the current state. For example, you want to move a user into a flow where certain input is required, but the input hasn’t been provided yet. A transition is suggested to the program by the LLM, but it’s impossible so the LLM has to be prompted that the transition is invalid and to correct itself. If it fails to transition again, a default fallback can be given but it’s not ideal at all.

However, another nice aspect of having the tree declared in advance is that it shows human beings what the system is capable and how it’s intended to be used as well. This has proven to be pretty useful, as letting the LLM call functions it sees fit based on broad intentions and system capabilities leaves humans in the dark a bit.

So, I like the structure and dependability. Maybe one day we can depend on LLM magic and not worry about a team understanding the ins and outs of what should or shouldn’t be possible, but we don’t seem to be there yet at all. That could be in part because my prompts were bad, though.


Any recommendations on patterns/approaches for these declarative logic trees and where you put which types of logic (logic which goes in the prompt, logic which goes in the code which parses the prompt response, how to detect errors in the response and retry the prompt, etc). On "Show HN" I see a lot of "fully automated agents" which seem interesting, but not sure if they are over-kill or not.


Personally, I've found that a nested class structure with instructions in annotated field descriptions and/or docstrings can work wonders. Especially if you handle your own serialization to JSON Schema (either by rolling your own or using hooks provided by libraries like Pydantic), so you can control what attributes get included and when.


The JSON serialization strategy worked really well for me in a similar context. It was kind of a shot in the dark but GPT is pretty awesome at using structured data as a prompt.


I actually only used an XState state machine with JSON configuration and used that data as part of the prompt. It worked surprisingly well.

Since it has an okay grasp on how finite state machines and XState work, it seems to do a good job of navigating the tree properly and reliably. It essentially does so by outputting information it thinks the state machine should use as a transition in a JSON object which gets parsed and passed to a transition function. This would fail occasionally so there was a recursive “what’s wrong with this JSON?” prompt to get it to fix its own malformed JSON, haha. That was meant to be a temporary hack but it worked well, so it stayed. There were a few similar tools for trying to correct errors. That might be one of the strangest developments in programming for me… Deploying non-deterministic logic to fix itself in production. It feels wrong, but it works remarkably well. You just need sane fallbacks and recovery tactics.

It was a proprietary project so I can’t share the source, but I think reading up on XState JSON configuration might explain most of it. You can describe most of your machine in a serializable format.

You can actually store a lot of useful data in state names, context, meta, and effect/action names to aid with the prompting and weaving state flows together in a language-friendly way. I also liked that the prompt would be updated by information that went along with the source code, so a deployment would reliably carry the correct information.

The LLM essentially hid a decision tree from the user and smoothed over the experience of navigating it through adaptive and hopefully intuitive language. I’d personally prefer to provide more deterministic flows that users can engage with on their own, but one really handy feature of this was the ability to jump out of child states into parent states without needing to say, list links to these options in the UI. The LLM was good at knowing when to jump from leaves of the tree back up to relevant branches. That’s not always an easy UI problem to solve without an AI to handle it for you.

edit: Something I forgot to add is that the client wanted to be able to modify these trees themselves, so the whole machine configuration was generated by a graph in a database that could be edited. That part was powered by Strapi. There was structured data in there and you could define a state, list which transitions it can make, which actions should be triggered and when, etc. The client did the editing directly in Strapi with no special UI on top.

Their objective is surveying people in a more engaging and personable way. They really wanted surveys which adapt to users rather than piping people through static flows or exposing them to redundant or irrelevant questions. Initially this was done with XState and no LLM (it required some non-ideal UI and configuration under the hood to make those jumps to parent states I mentioned, but it worked), and I can't say how effective it is but they really like it. The AI hype was very very strong on that team.


I'm building a whole AI agent-building platform on top of Xstate actors. Check it out craftgen.ai or https://github.com/craftgen/craftgen


LangGraph


>Also STT is rife with speaker interjections, leading to significant user frustrations and they just want to talk to a person. Hard to see if this is really solved yet.

This is not using TTS or STT. Audio and Image data can be tokenized as readily as text. This is simply a LLM that happens to have been trained to receive and spit out audio and image tokens as well as text tokens. Interjections are a lot more palatable in this paradigm as most of the demos show.


Adding audio data as a token, in and of itself, would dramatically increase training size, cost, and time for very little benefit. Neural networks also generally tend to function less effectively with highly correlated inputs, which I can only assume is still an issue for LLMs. And adding combined audio training would introduce rather large scale correlations in the inputs.

I would wager like 100:1 that this is just introducing some TTS/STT layers. The video processing layer is probably also doing something similarly, by taking an extremely limited number of 'screenshots', carrying out typical image captioning using another layer, and then feeding that as an input. So the demo, to me, seems most likely to just be 3 separate 'plugins' operating in unison - text to speech, speech to text, and image to text.

The interjections are likely just the software being programmed to aggressively begin output following any lull after an input pattern. Note in basically all the videos, the speakers have to repeatedly cut off the LLM as it starts speaking in conversationally inappropriate locations. In the main video which is just an extremely superficial interaction, the speaker made sure to be constantly speaking when interacting, only pausing once to take a breath that I noticed. He also struggled with the timing of his own responses as the LLM still seems to be attached to its typical, and frequently inappropriate, rambling verbosity (though perhaps I'm not one to critique that).


>I would wager like 100:1 that this is just introducing some TTS/STT layers.

Literally the first paragraph of the linked blog.

"GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, and image and generates any combination of text, audio, and image outputs."

Then

"Prior to GPT-4o, you could use Voice Mode to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio. This process means that the main source of intelligence, GPT-4, loses a lot of information—it can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion.

With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network."


I can’t square this with the speed. A couple of layers doing STT are technically still part of the neural network, no? Because the increase in token base to cover multimodal tokenization would make even text inference slower, not twice as fast, as 4-turbo.

But I’m not an expert!


Open ai give so little information on the details of their models now that one can only speculate how they've managed to cut down inference costs.

STT throws away a lot of information that is clearly being preserved in a lot of these demos so that's definitely not happening here in that sense. That said, the tokens would be merged to a shared embedding space. Hard to say how they are approaching it exactly.


I'd mentally change the acronym to Speech to Tokens. Parsing emotion and other non-explicit indicators in speech has been an ongoing part of research for years now. Meta-data of speaker identity, inflection, etc could easily be added and current LLMs already work with it just fine. For instance asking Claude, with 0 context, to parse the meaning of "*laughter* Yeah, I'm sure that's right." instantly yields:

----

The phrase "*laughter* Yeah, I'm sure that's right" appears to be expressing sarcasm or skepticism about whatever was previously said or suggested. Here's a breakdown of its likely meaning:

"*laughter*" - This typically indicates the speaker is laughing, which can signal amusement, but in this context suggests they find whatever was said humorous in an ironic or disbelieving way.

"Yeah," - This interjection sets up the sarcastic tone. It can mean "yes" literally, but here seems to be used facetiously.

"I'm sure that's right." - This statement directly contradicts and casts doubt on whatever was previously stated. The sarcastic laughter coupled with "I'm sure that's right" implies the speaker believes the opposite of what was said is actually true.

So in summary, by laughing and then sarcastically saying "Yeah, I'm sure that's right," the speaker is expressing skepticism, disbelief or finding humor in whatever claim or suggestion was previously made. It's a sarcastic way of implying "I highly doubt that's accurate or true."

----


It could be added. Still wouldn't sound as good as what we have here. Audio is Audio and Text is Text and no amount of metadata we can practically provide will replace the information present in sound.

You can't exactly metadata your way out of this (skip to 11:50)

https://www.youtube.com/live/DQacCB9tDaw?si=yN7al6N3C7vCemhL


I'm not sure why you say so? To me that seems obviously literally just swapping/weighting between a set of predefined voices. I'm sure you've played a game with a face generator - it's the exact same thing, except with audio. I'd also observe in the demo that they explicitly avoided anything particularly creative, instead sticking within an extremely narrow domain very basic adjectives: neutral, dramatic, singing, robotic, etc. I'm sure it also has happy, sad, angry, mad, and so on available.

But if the system can create a flamboyantly homosexual Captain Picard with a lisp and slight stutter engaging in overt innuendo when stating, "Number one, Engage!" then I look forward to eating crow! But as the instructions were all conspicuously just "swap to pretrained voice [x,y,z]", I suspect crow will not be on the menu any time soon.



What about the input of the heavy breathing?


I'm sorry but you don't know what you're talking about and I'm done here. Clearly you've never worked with or tried to train STT or TTS models in any real capacity so inventing dramatic capabilities, disregarding latency and data requirements must come easily for you.

Open AI have explicitly made this clear. You are wrong. There's nothing else left to say here.


Since OpenAI has gone completely closed, they've been increasingly opaque and dodgy about how even things like basic chat works. Assuming the various leaked details of GPT-4 [1] are correct (and to my knowledge there has been no indication that they are not), they have been actively misleading and deceptive - as even the 'basic' GPT4 is a mixture of experts system, and not one behemoth neural network.

[1] - https://lifearchitect.ai/gpt-4/


A Mixture of Experts model is still one behemoth neural network and believing otherwise is just a common misconception on term.

MoE are attempts at sparsity, only activating a set number of neurons/weights at a time. They're not separate models stitched together. They're not an Ensemble. I blame the name at this point.


I would ask you to watch the demo on SoundHound.com. It does less, yes, but it's so crucially fit for use. You'll notice from the shown gpt-4 demo they were guiding the LLM into chain of reasoning. It works very well when you know how to work it, which aligns with what you're saying. I don't mean to degrade the achievement, it's great, but we often inflate the expectations of what something can actually do before reaching real productivity.


I think if you listen to the way it answers, it seems its using a technique trained speakers use. To buy itself time to think, it repeats/paraphrases the question/request before actually answering.

I'm sure you'll find this part is a lot quicker to process, giving the instant response (the old gpt4-turbo is generally very quick with simple requests like this). Rather impressively all it would need is an additional custom instruction.

Very clever and eerily human.


This behavior is clearly shown on the dad joke demo: https://vimeo.com/945587876


Have you seen this video from Microsoft, uploaded to YT in 2012, the actual video could be even older: https://www.youtube.com/watch?v=Nu-nlQqFCKg


While I'm not convinced it is VC vs. bootstrapped that should differentiates frameworks, you should read this book, it is simply the closest thing to a step by step user guide to finding PMF as I've seen, and it worked for me: https://www.amazon.com/Four-Steps-Epiphany-Steve-Blank/dp/09...


I have the same concerns generally. But one non-evil popped into my head...

My dad passed away a few months ago. Going through his things, I found all of his old papers and writings; they have great meaning to me. It would be so cool to have them as audio files, my dad as the narrator. And for shits, try it with a British accent.

This may not abate the concerns, but I'm sure good things will come too.


Serious question: is this a healthy way to treat ancestors? In the future will we just keep grandma around as an AI version of her middle aged self when she passes?


Fair question. People have kept pictures, paintings, art, belongings, etc of their family members for countless generations. AI will surely be used to create new ways to remember loved ones. I think that is a big difference than "keeping around grandma as an AI version of herself", and pretending they are still alive, which I agree feels unhealthy.


Made me think how different can be generations and what is countless. I can count back 2 and have not even seen anything about my grandfather who was born 101 years before me (1875). At least I have “signature” as XXX in 1860ies on a plan of farmland from his father, when he bought it out from slavery. And that actual farmland. Good luck AIng that heritage.


I think everyone's entitled to their opinion here. As for me, though: my brother died at 10 years old (back in the 90s). While there are some home videos with him talking, it's never for more than a few seconds at a time.

Maybe a decade ago, I came across a cassette tape that he had used to record himself reading from a book for school - several minutes in duration.

It was incredibly surprising to me how much he sounded like my older brother. It was a very emotional experience, but personally, I can't imagine using that recording to bootstrap a model whereby I could produce more of his "voice".


There's a Black Mirror episode about something like that, though I don't remember the details.


I remember a journalist actually doing it, but just the AI part of course, not the robot.


Yup, "Be Right Back", S2E1

And possibly another one, but that would be a spoiler


It seems unhealthy for us to sort out what is and is not a healthy way for someone else to mourn, or to remember, their own grandmother.

It is healthier for us to just let others do as they wish with their time without passing judgement.


it worked for super man, he seemed well adjusted after talking to his dead parents.


Not sure if this is related to this tech, but I think it is worthwhile: The Beatles - Now And Then - The Last Beatles Song (Short Film)

https://www.youtube.com/watch?v=APJAQoSCwuA


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: