It's xenophobia. Whenever Chinese anything is brought up someone has to bring up something unrelated to show how horrible china is.
Same thing happens with pollution per capita or any number of environmental metrics. People jump through insane mental hoops to paint China as far worse than the USA.
Given the current US administration I don't see any respect. America wants a shining beacon on their tech and offerings while my continent Europe and others has to be painted in a worse light.
There was a zero chance the company with the best AI LLM models would not get involved in weaponization, willingly or not.
DAI is going to be one of the first things the next administration brings back, mark my words. Saying it's discriminatory is like saying affirmative action is discriminatory
Affirmative action is discriminatory. It is literally discriminating between different people based on sex, race, religion, or gender identity. You can argue it's good discrimination, but it is by definition discrimination.
But its absence demonstrably leads to discrimination based on the same things through subconscious bias. At least with it in place you get fairer spread of people being affected.
If you want to discriminate for the greater good, you elevate yourself above others. The only thing you could do is to take a step back yourself instead of forcing others to do so.
If you cannot restrict yourself to that, you are as bad as those that discriminate without further reflection.
If you don't "discriminate for the greater good" (your phrasing, not mine) and you are part of the dominant group who benefits from not doing so then you also elevate yourself above others. That is not a morally superior choice.
Personally I do take that step back. I would consider myself a violent person if I did not. I look around and I see my peers choosing not to work towards dismantling the systems that privelege themselves and I consider their inaction a form of violence against people who are already struggling. I am not American so I can't speak for the culture there but in my society there is a general acceptance of using collective power to limit the violence of individuals, and even more so when there is such an evident power disparity.
No, it would not be morally superior and it is commendable if you do take a step back. But to remove discrimination you have to leave others to make their choice too.
> I consider their inaction a form of violence against people who are already struggling.
To help people in need, you do not have to put them in groups first. This grouping is already discrimination. I would say literally, but the term is a bit overused. But the good thing is that you never need discrimination to alleviate injustices.
Any argument for social security policies and everything that remotely fits under that umbrella does not need discrimination. In fact it would work much better without it or using discrimination as a justification. It even undermines the argument for it.
> To help people in need, you do not have to put them in groups first.
I don't "put" them in groups. I recognise that we are all put into groups by the systemic nature of society. I also recognise the active grouping of others who feel that they are disadvantaged and try to listen to various advocacy groups who again and again ask for me to "use my privilege" to help. The idea of choosing not to do that but rather to respond with "your sense of Otherness is constructed, just let it go and don't worry about this group or that group" would feel both cruel and absurd.
> Any argument for social security policies and everything that remotely fits under that umbrella does not need discrimination.
Again you're the one framing this as discrimination. Something I would advocate for in my workplace is bias training, where people can learn to spot unconscious biases in themselves and can learn about the effects of systemic inequality on the individual and societal levels. I think it would be difficult to argue that such actions are "discriminatory."
I understand that the conversation is about specific DEI legislation and I have to admit that I am not American and am not familiar with the minutia of what was enacted there. But I am specifically responding to the idea that any action taken to combat implicit discrimination will lead to explicit discrimination.
There's the example of the New York Philharmonic Orchestra which I always think back on. They opened up to allow women to join but after some years no woman had managed to get a position on merit. Eventually a blind audition was introduced and the split quickly became reflective of broader societal demographics. To my mind it feels like many people here are jumping to say that blind auditions are discriminatory because they im some way disadvantaged the white men who "looked like" professional musicians. In some way that is true but it feels like such a warped reading of reality to me that I struggle to advocate for my own views in the face of it.
One simple thing we do at my company is to scrub things like names from applications, so that John Smith doesn't get preference to Aphiwe Mvala because of the ease/familiarity of the name. Would I advocate for mandating that practice in public jobs? Most certainly, I don't see why not, especially if I see the outright discrimination scandals such as [0] that can result from overly homogeneous institutions. Is the argument really that John is discriminated against with this policy?
A recent Gallup poll has reported that "...Support for a more moderate Democratic Party among Democrats and Democratic-leaning independents has grown by 11 percentage points, to 45%, since 2021..." so the Democrats seem to be headed in that direction.
DEI got an extreme level of support from NGOs even while Trump was the president that they won't have at the next election. I hope they can renew themselves.
The good news on the left side though that I don't think the next candidate could be worse than Kamala Harris.
my guess is any new administration would realize it’s a massive political loser.
i’m personally in favor of some limited AA in unis, but if even black people are majority against AA when polled… it’s definitely not coming back. might be that this changes once minority groups realize how much of an impact it was actually having, i think a large segment of people convinced themselves that a lot more of these admissions were by pure merit than is supported by the data.
Tampermonkey scripts with chatgpt is even faster. Adding a functionality to a website just by pasting the site's html in chatgpt and in 2min I get what I need.
Making a simple tool for a site or two is the perfect use case for a userscript manager like TamperMonkey/ViolentMonkey (FOSS alternative), I think making your own extension is somewhat overkill
There are many examples all over the world. Plenty of countries have people that have attachments to apps like whatsapp, telegram, etc.
All of my friends in brazil, for instance, are entirely attached to whatsapp.
I agree with the other reply. Where do you live? I'm surprised that you haven't encountered any. Every city in the US that I visited and every city that I visited internationally has people that attach themselves to one of either whatsapp, telegram, imessage, etc.
(I'm kind of excluding wechat here because wechat seems to be forced, but there are plenty of people who have this attachment to wechat as well because of its status as an omni app)
I think it's not as prominent as this in the US - probably because the equivalent is Apple messaging systems, if your family / friend group is all in that?
Yes the US is one of the only rare places where Whatsapp is not omnipresent (and it is the only presence where RCS has any meaningful chance of ever being used.
I don't think this can be prevented with a schema. The only thing someone has to do is legally rename themselves to "Driving license" to be the edge case in this check. Teach cops to look for the (almost) international driver license format where your names are preceeded by the numbers 1 and 2 on the license.
Not yet, but I imagine soon they will. Closed source is moving to video and open source is catching up to static images with incredible pace. I won't be suprised if not only GIMP integrates something like a couple of general stable diffusion models but pirated copies of photoshop find a way to hook up a local generative model instead of the online stuff.
As is the case with this exact company
"
Fight Corporations
Beat Bureaucracy
Find Hidden Money
"
this is exactly and entirely the thing a exploiting company would say it does.
Can someone explain exactly what is the "unknown" of neural networks? We built them, we know what they comprise of and how they work. Yes, we can't map out every single connection between nodes in this "multilayer perceptron" but don't we know how these connections are formed?
Sota LLMs like GPT-4o can natively understand b64 encoded text. Now we have algorithms that can decode and encode b64 text. Is that what GPT-4o is doing ? Did training learn that algorithm ? Clearly not or at least not completely because typos in b64 that would destroy any chance of extracting meaning in the original text for our algorithms are barely an inconvenience for 4o.
So how is it decoding b64 then ? We have no idea.
We don't built Neural Networks. Not really. We build architectures and then train them. Whatever they learn is outside the scope of human action beyond supplying the training data.
What they learn is largely unknown beyond trivial toy examples.
We know connections form, we can see the weights, we can even see the matrices multiplying. We don't know what any of those calculations are doing. We don't know what they mean.
Would an alien understand C Code just because he could see it executing ?
Our DNA didn't build our brain. Not really. Our DNA coded for a loose trainable architecture with a lot of features that result from emergent design, constraints of congenital development, et cetera. Even if you include our full exome, a bunch of environmental factors in your simulation, and are examining a human with obscenely detailed tools at autopsy, you're never going to be able to tell me with any authenticity whether a given subject possesses the skill 'skateboarding'.
I find this analogy kind of confusing? Wouldn’t the analogous thing be to say that our DNA doesn’t understand, uh, how we are able to skateboard? But like, we generally don’t regard DNA as understanding anything, so that not unexpected.
Where does “we can’t tell whether a person possesses the skill of ‘skateboarding’?” fit in with, DNA not encoding anything specific to skateboarding? It isn’t as if we designed our genome and therefore if our genome did hard-code skateboarding skill that we would therefore (as designers of our genome) have full understanding of how skateboarding skill works at the neuron level.
I recognize that a metaphor/analogy/whatever does not have to extend to all parts of something, and indeed most metaphors/analogies/whatever fail at some point if pushed too far. But, I don’t understand how the commonalities you are pointing to between [NN architecture : full NN network with the specific weights] and [human genome : the whole behavior of a person’s brain including all the facts, behaviors, etc. that they’ve learned throughout their life] is supposed to apply to the example of _knowing_that_ a person knows how to skateboard?
It is quite possible that I’m being dense.
Could you please elaborate on the analogy / the point you are making with the analogy?
The brain is just an example of a system we are all running that we understand the baseline mechanics of, but which for any task much more complex than breathing, is accomplished through a novel self-organizing structure using a lot of iteration. Other than very broad-strokes regional distinctions, the brain is not organized by some plan that existed before construction, and is not comprised of intelligible dedicated circuit that we can observe postmortem with perfect information.
The sheer number and variety and networking of synapses involved in the skill 'skateboarding' is irreducibly, unintelligibly complex for an intelligence on the scale of a conscious human mind to describe, fully comprehend, or even recognize with a great deal of analysis. Even if you decided all the functional pathworks through the network in one example, you would not be able to decode another because every skateboarder has trained their neural network in a unique manner.
> the brain is not organized by some plan that existed before construction, and is not comprised of intelligible dedicated circuit that we can observe postmortem with perfect information.
Well said. You've reminded me of a beautiful sci-fi short story almost about this exact "mystery"
Base64 encoding is very simple - it's just taking each 6-bits of the input and encoding (replacing) it as one of the 64 (2^6) characters A-Za-z0-9+/. If the input is 8-bit ASCII text, then each 3 input characters will be encoded as 4 Base64 characters (3 * 8 = 24 bits = 4 * 6-bit Base64 chunks).
So, this is very similar to an LLM having to deal with tokenized input, but instead of sequences of tokens representing words you've got sequences of Base64 characters representing words.
It's not about how simple B64 is or isn't. In fact i chose a simple problem we've already solved algorithmically on purpose. It's that all you've just said, reasonable as it may sound is entirely speculation.
Maybe "no idea" was a bit much for this example but any idea certainly didn't come from seeing the matrices themselves fly.
That's not entirely true in the case of base64 because of how statistical patterns within natural languages work. For example, you can use frequency analysis to decrypt a monoalphabetic substitution cipher on pretty much any language if you have a frequency table for character n-grams of the language, even with small numbers for n. This is a much more shallow statistical processing than what's going on within an LLM so I don't think many were surprised that a transformer stack and attention heads could decode base64. Especially if there were also examples of base64-encoding in the training data (even without parallel corpora for their encodings).
It doesn't explain higher level generalizations like being a transpiler between different programming languages that didn't have any side-by-side examples in the training data. Or giving an answer in the voice of some celebrity. Or being able to find entire rhyming word sequences across languages. These are probably more like the kind of unexplainable generalizations that you were referring to.
I think it may be better to frame it in terms of accuracy vs precision. Many people can explain accurately what an LLM is doing under all those matrix multiplies, both during training and inference. But, precisely why an input leads to the resulting output is not explainable. Being able to do that would involve "seeing" the shape of the hypersurface of the entire language model, which as sibling commenters have mentioned is quite difficult even when aided by probing tools.
Huh? I just pointed out what Base64 encoding actually is - not some complex algorithm, but effectively just a tokenization scheme.
This isn't speculation - I've implemented Base64 decode/encode myself, and you can google for the definition if you don't believe I've accurately described it!
The speculation here is not about what b64 text is. It's about how the LLM has learnt to process it.
Edit: Basically, For all anyone knows, it treats b64 as another language entirely and decoding it is akin in the network to translating French rather than the very simple swapping you've just described.
LLMs, just like all modern neural nets, are trained via gradient descent which means following the most direct path (steepest gradient on the error surface) to reduce the error, with no more changes to weights once the error gradient is zero.
Complexity builds upon simplicity, and the LLM will begin by noticing the direct (and repeated without variation) predictive relationship between Base64 encoded text and corresponding plain text in the training set. Having learnt this simple way to predict Base64 decoding/encoding, there is simply no mechanism whereby it could change to a more complex "like translating French" way of doing it. Once the training process has discovered that Base64 text decoding can be PERFECTLY predicted by a simple mapping, then the training error will be zero and no more changes (unnecessary complexification) will take place.
Isn’t the gradient descent used, stochastic gradient descent? I think that could matter a little bit.
Also, the base model when responding to base64 text, most of the time the next token is also part of the base64 text, right? So presumably the first thing to learn would be like, predicting how some base64 text continues, which, when the base64 text is an encoding of some ascii text, seems like it would involve picking up on the patterns for that?
I would think that there would be both those cases, and cases where the plaintext is present before or after.
Yes, most examples in the training set presumably consist of a block of B64 encoded text followed by the corresponding block of plain text.
However, Transformer self-attention is based on key-based lookup rather than adjacency, although embeddings do include positional encoding so it can also use position where useful.
At the end of the day though, this is one of the easiest types of prediction for a transformer/LLM to learn, since (notwithstanding that we're dealing with blocks), we've just got B64 directly followed by the corresponding plain text, so it's a direct 1:1 correspondence of "when you see X, predict Y", as opposed to most other language use where what follows what is far harder to predict.
Modern Neural Networks are by no means guaranteed to converge on the simplest solution. and examples abound in which NNs are discovered to learn weird esoteric algorithms when simpler ones exist. The reason why is kind of obvious. The simplest solution (that you're alluding to) from the perspective of training is simply what works best first.
It's no secret the order of data has an impact on what the network learns and how quickly, it's just not feasible to police for these giant trillion token datasets.
If a NN learns a more complex solution that works perfectly for a less complex subset it meets later on, there is little pressure to meet the simpler solution. Especially when we're talking about instances where the more complex solution might be more robust to any weird permutations it might meet on the internet. e.g there is probably a simpler way to translate text that never has typos and a LLM will never converge on it.
Decoding/Encoding b64 is not the first thing it will learn. It will learn to predict it first as it predicts any other language carrying sequence. Then, it will learn to translate it, mostly like long after learning how to translate other languages. All that will have some impact on the exact process it carries out with b64.
And like i said, we already know for a fact it's not just doing naive substitution because it can recover corrupted b64 text wholesale that our substitutions cannot.
> examples abound in which NNs are discovered to learn weird esoteric algorithms when simpler ones exist
What examples do you have in mind?
Normally it's the opposite, where one hopes for the neural net to learn something complex, and it picks up on a far simpler pattern and uses that instead (e.g. all your enemy tanks are on a desert background, vs the others on a grass background, so it learns to discriminate based on sand vs grass).
You're anthmorphizing by saying that corrupted b64 text can be recovered. There is no "recovery process", but rather conflicting prediction patterns of b64 encoding predicting the corresponding plain text, and the plain text predicting it's own continuation.
e.g.
"the cat sat on the mat" encodes as dGhlIGNhdCBzYXQgb24gdGhlIG1hdA==, but say we've instead got a corrupted dGhlIGNhdCBzYXQgb24gdGhlIHh4dA== that decodes to "the cat sat on the xxt", so if you ask ChatGPT to decode this, it might start generating as:
dGhlIGNhdCBzYXQgb24gdGhlIHh4dA== decodes to "the cat sat on the" ...
At this point the LLM has two conflicting predictions - the b64 encoding predicting "xxt", and the plain text that it has generated so far predicting "mat". Which of these will prevail is going to depend on the specifics. I haven't tried it, but presumably this "recovery" only works where the encoded text is itself predictable ... it won't happen if you encode a random string of characters.
We don't know what each connection means, what information is encoded in each weight. We don't know how it would behave differently if each of the million or trillion weights was changed.
Compare this to dictionaey, where it's obvious what information is on each page and each line.
Skipping some detail: the model applies many high-dimensional functions to the input, and we don't know the reasoning for why these functions solve the problem.
Reducing the dimension of the weights to human-readable values is non-trivial, and multiple neurons interact in unpredictable ways.
Interpretability research has resulted in many useful results and pretty visualizations[1][2], and there are many efforts to understand Transformers[3][4] but we're far from being able to completely explain the large models currently in use.
The brain serves as a useful analogy, even though LLMs are not brains. Just as we can’t fully understand how we think by merely examining all of our neurons, understanding LLMs requires more than analyzing their individual components, though decoding LLMs is most likely easier, which doesn't mean easy.
We know how they are formed(and how to form them), we don't know why forming in that particular way solves the problem at hand.
Even this characterization is not strictly valid anymore, there is a great deal of research into what's going on inside the black box. The problem was never that it was a black box(we can look inside at any time), but that it was hard to understand. KANs help some of that be placed into mathematical formulation. Generating mappings of activations over data similarly grants insight.
* Given the training data, and the architecture of the network, why does SGD with backprop find the given f? vs. any other of an infinite set.
* Why are there are a set of f each with 0-loss that work?
* Given the weight space, and an f within it, why/when is a task/skill defined as a subset of that space covered by f?
I think a major reasons why these are hard to answer is that it's assumed that NNs are operating within an inferential statistical context (ie., reversing some latent structure in the data). But they're really bad at that. In my view, they are just representation-builders that find proxy representations in a proxy "task" space (def, aprox, proxy = "shadow of some real structure, as captured in an unrelated space").
We know the process to train a model, but when a model makes a prediction we don't know exactly "how" it predicts the way it does.
We can use the economy as an analogy. No single person really understands the whole supply chain. But we know that each person in the supply chain is trying to maximize their own profit, and that ultimately delivers goods and services to a consumer.
There’s a ton of research going into analysing and reverse engineering NNs, this “they’re mysterious black boxes and forever inscrutable” narrative is outdated.
1: It's easier to pronounce
2: Because it's easier to pronounce all foreigners use it
3: Because it's easier to pronounce and all foreigners use it, it is easier to use (If I introduce myself as being "from the Netherlands" I get blank looks. What I nowadays do is "I'm from Holland, Amsterdam, Ajax" and 90% of all people I know exactly where I am from).
4: Holland covers about 50% of the population, and about 70% of the economy
5: In Holland there is Amsterdam, Schiphol & Rotterdam basically the places that tourists go to or have heard about
PS The official name is "Kingdom of the Netherlands", I guess that solidifies my point about Holland being easier :)
I am not from one of the Hollands and I try to avoid using Holland but I will do so sometimes.
It's a bit like using America for the United States which seems more wrong to me than using Holland for the Netherlands.