Hacker Newsnew | past | comments | ask | show | jobs | submit | 0xjunhao's commentslogin

Hi, I'm the author of this post. Writing it was a great learning experience. I gained a lot of insight into vLLM. If you have any feedback or questions, feel free to drop a comment below!


In your forward pass section you give a lot of emphasis to FlashAttention, but it might be worth mentioning Paged Attention as well (which was the paper written by the vLLM authors and I believe was the genesis of the project). PA-style block tables are now supported in most fused attention kernels, but vLLM originally came up with it and it's the main reason why vLLM has such high throughput!


Thank you! We have incorporated your suggestion.


Thanks for writing the article!

I didn't quite get

Note that during the prefill phase, all prompt tokens from a request can be processed in one batch. This is possible because the query (Q) tensors, calculated from the tokens immediately before them, are available for each prompt token position.

I know that in practice prefill is much faster than inference. Would watching the 2h video from Karpathy help me understand why?


That snippet is trying to say that you can calculate KV for all the input tokens at once, and you don't need to loop over them since you have them all available.

Instead for decode, you need to sequentially generate each token.


And on the topic of prefill: Do you know what the role of GPUs is vs. in inference?


Prefill is part of Inference. It's the first major step where you calculate all the keys and values for the input tokens.

Decode is the next major step where you start generating output tokens one at a time.

Both run on GPUs but have slightly different workloads

1. Prefill has very little I/o from VRAM to HBM and more compute 2. Decode is light on compute but have to I/o the keys and values computed in the prefill stage for every output token


Doesn't decode also need to stream in the whole of the model weights, thus very I/O heavy?


Yes, decoding is very I/O heavy. It has to stream in the whole of the model weights from HBM for every token decoded. However, that cost can be shared between the requests in the same batch. So if the system has more GPU RAM to hold larger batches, the I/O cost per request can be lowered.


Great write up, it would be interesting to see a lot of those covered features in comparison to other frameworks!


Thanks for this! Learnt a lot.

Curious to understand how do we ensure that the same model instance gets requests from the same client/user? Since conversations are stateful and the model needs context from previous turns of the conversation.

Is this happening at the load balancer layer?


It's either sticky sessions or an lb that keeps track of prior sequences and route to the instance with the largest match. https://docs.sglang.ai/router/router.html


They’re not stateful, you submit the entire history with every call. Caching of prompts etc makes it important for performance to have sticky sessions or smth at the load balancer layer


Yes, typically users send the newest user message and the full conversation history. These combined become the prompt.

Our API endpoint will try to route requests that has the same prefix to the same vLLM instance (similar to longest prefix matching in networking), and hopefully there are still some KV caches for part of the prompt there.


FYI a literature review from a deep research AI ======

The effect of noise on sleep is multifaceted, involving various types of noise exposure, physiological mechanisms, and consequences on sleep quality and health.

I. Introduction Environmental and occupational noise refers to unwanted or harmful sounds from sources such as traffic, aircraft, railways, and workplaces. Noise pollution is a significant public health issue due to its widespread prevalence and impact on sleep and overall health.

II. Types of Noise Exposure

Environmental noise includes road traffic, aircraft, and railway noise. Aircraft noise, for example, has been shown to negatively affect sleep in children, whereas road traffic noise may have less impact on autonomic activity during sleep [Effect of Noise on Sleep and Autonomic Activity in Children]. Occupational noise exposure also contributes to sleep disturbances and has been linked to sleep loss and fragmentation [The effects of occupational noise on sleep]. Intentional masking noise such as white or pink noise is sometimes used as a sleep aid to improve sleep quality in noisy environments [The effects of white noise on sleep].

III. Physiological Mechanisms During sleep, the auditory system continues to process sounds, and noise can lower arousal thresholds, causing awakenings or micro-arousals. Noise exposure activates the autonomic nervous system, increasing heart rate and blood pressure, and stimulates the hypothalamic-pituitary-adrenal (HPA) axis, leading to hormonal and metabolic changes [Environmental noise and sleep disturbances].

IV. Effects on Sleep Architecture Noise exposure reduces total sleep time and sleep efficiency, increases sleep fragmentation, and causes more frequent micro-arousals. It also alters sleep stages by decreasing deep sleep (slow-wave sleep) and REM sleep, which are critical for restorative sleep [Effects of environmental noise on sleep].

V. Health and Daytime Consequences Poor sleep due to noise leads to daytime sleepiness, cognitive impairments, and mood disturbances. Chronic noise-related sleep disruption is associated with increased cardiovascular and metabolic risks, such as hypertension and glucose metabolism disturbances. It also negatively affects quality of life and can contribute to burnout [Environmental Noise and Effects on Sleep], [Effects of personal noise exposure, sleep quality, and burnout].

VI. Vulnerable and Special Populations Children are particularly sensitive to noise, with aircraft noise shown to disrupt their sleep more than road traffic noise. Older adults and individuals with pre-existing sleep disorders are also more vulnerable to noise-induced sleep disturbances [Effect of Noise on Sleep and Autonomic Activity in Children].

VII. Noise as a Sleep Aid Paradoxically, steady background noise such as white or pink noise can improve sleep quality by masking disruptive environmental sounds. Studies have shown that white noise can significantly enhance subjective and objective sleep measures in noisy urban settings. However, potential downsides include hearing damage and psychological dependency on noise for sleep [Noise as a sleep aid: A systematic review], [The effects of white noise on sleep].

VIII. Mitigation and Future Directions Effective strategies to mitigate noise effects on sleep include engineering controls like soundproofing, use of earplugs, and regulatory noise limits. Behavioral interventions and public health policies are also important. Future research is needed to clarify long-term effects, dose-response relationships, and to develop personalized interventions [The Effect of Room Acoustics on the Sleep Quality of Healthy Sleepers].

In summary, noise negatively impacts sleep by disrupting sleep architecture and triggering physiological stress responses, leading to adverse health outcomes. While some forms of noise can aid sleep by masking disturbances, overall noise reduction remains critical for improving sleep quality and health.

References

Environmental noise and sleep disturbances: A threat to health? Demian Halperin et al. Sleep Science, 2014 Nov 15. https://pmc.ncbi.nlm.nih.gov/articles/PMC4608916/ Poor sleep causes measurable changes on these systems. Experimental studies demonstrated that both sleep restriction and poor quality sleep affect glucose ...

Effects of environmental noise on sleep Kenneth I Hume et al. Noise & health, 2012. https://pubmed.ncbi.nlm.nih.gov/23257581/ This paper summarizes the findings from the past 3 year's research on the effects of environmental noise on sleep and identifies key future research goals.

The Effect of Room Acoustics on the Sleep Quality of Healthy Sleepers Ingo Fietze et al. Noise & Health, 2016 Sep-Oct. https://pmc.ncbi.nlm.nih.gov/articles/PMC5187651/ Noise is one of the factors that can seriously disturb sleep, and sound volume is an important factor in this context. One strategy involves avoiding ...

Noise as a sleep aid: A systematic review Samantha M. Riedy et al. Sleep Medicine Reviews, 2021/02/01. https://www.sciencedirect.com/science/article/abs/pii/S10870... ... sleep aid, especially since it may also negatively affect sleep and hearing. ... Suzuki et al. Sleep deepening effect of steady pink noise. J Sound Vib. (1991).

Environmental Noise and Effects on Sleep: An Update to the WHO Systematic Review and Meta-Analysis Michael G Smith et al. Environmental Health Perspectives, 2022 Jul 11. https://pmc.ncbi.nlm.nih.gov/articles/PMC9272916/ Jul 11, 2022 ... To what extent have the following outcomes of railway noise occurred in the past 12 months? Railway noise disturbs when falling asleep. Not ...

Effects of personal noise exposure, sleep quality, and burnout on quality of life: An online participation cohort study in Taiwan Ta-Chien Chan et al. Science of The Total Environment, 2024/03/10. https://www.sciencedirect.com/science/article/pii/S004896972... Mar 10, 2024 ... To our knowledge, this is the first study to explore the pathways through which daily noise exposure, sleep quality, and personal burnout affect ...

Effect of Noise on Sleep and Autonomic Activity in Children according to Source https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8476937/ Road traffic noise did not significantly alter sleep or autonomic activity in children, whereas aircraft noise exerted a negative effect on sleep.

The effects of occupational noise on sleep: A systematic review Saeid Yazdanirad et al. Sleep Medicine Reviews, 2023/12/01. https://www.sciencedirect.com/science/article/abs/pii/S10870... One of the important effects due to noise exposure is sleep loss/disturbance, which has received less attention. Sleep disturbance is defined as problems with ...

The effects of white noise on sleep and duration in individuals living in a high noise environment in New York City Matthew R Ebben et al. Sleep medicine, 2021/7. https://pubmed.ncbi.nlm.nih.gov/34049045/ Our data show that white noise significantly improved sleep based on subjective and objective measurements in subjects complaining of difficulty sleeping.


IMHO the most appropriate medical decision should depend on the patient's economic situation.


I remember trying to publish something of a similar style to arxiv and getting rejected. Seems that having an abstract and references is key :)


If you have ever lost someone due to other drivers not watching the road, you will start to appreciate that these cars are at least always watching the road and won’t kill people standing right in front of them. Nevertheless, a lot of improvements still need to be made, and I hope they do them quickly and safely.


What's the difference between resign and layoff :)


Resign is employee initiated, being laid off, or layoff, is company initiated. Not sure about other states, but in Illinois if you resign, most likely you won't qualify for unemployment benefits. If you're laid off (as I've been multiple times), you can qualify for unemployment benefits. I'm basing this on my experience, but you should consult an attorney before signing anything if you're being fired/laid off/let go, etc.


In Germany, it’s similar: you’re entitled to unemployment benefits regardless (if you’ve been an employee of a company that pays for your salary and taxes), but if you quit you don’t get benefits for 3 months as a “penalty”, so people ask to get fired by their employer and in some cases the employer obliges.


That and severance pay if the company does it. From the article: "if they chose [to resign], they would not receive severance pay."


Hi, I had a quick question. Would it be correct to say the following?

1. For long inputs and short outputs, the inference can be arbitrarily number of times faster, as it avoids repeated KV computation.

2. Conversely, for short inputs and long outputs, it might be slightly slower, since loading and storing the KV cache are on the critical path of the execution.


It is almost true for both. Although for the second case you can just skip storing in these cases where there is little improvement.


Before I became a software engineer, I was a computational physicist. My days back then were pretty much tweaking some parameters, running a job, then reading papers and checking back after a few minutes or hours. Increasingly, I’m starting to think my days as a software engineer will be pretty similar.


With the rise of Agentic AI, this increasingly feels like the right move, unless AWS drastically lowers their prices.


Agreed, LLMs helped us with this.


In a world of LLMs, it's great to see classic NLP works like Harper. Both definitely have their own use cases.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: