I've tried out DeepSeek on deepseek.com and it refuses conversations about several topics censored in China (Tiananmen, Xi Jinping as Winnieh-the-Pooh).
Has anyone tried if this also happens when self-hosting the weights?
I haven't tried that base model yet but I have tried with the coder model before and experienced similar things. A lot of refusals to write code if the model thought that it was unethical or could be used unethically. Like asking it to write code to download images from an image gallery website would work or not depending on what site it thought it was going to retrieve from.
When I try out the topics you suggest at the huggingface endpoint you link, the answer is either my question translated into Chinese, or no answer when I prompt the model in Chinese:
Interesting - I can't speak to the Huggingface endpoint. I downloaded the 4-bit GGUF model locally and ran it through Oobabooga with instruct-chat template - I expressed my questions in English.
Chinese IP are not allowed to use ChatGPT.
Chinese credit card is not allowed for OpenAI API.
Source: my own experience.
What puzzles me most is the second restriction. My credit card is accepted by AWS, Google, and many other services. It is also accepted by many services which use Stripe to process payments.
Perhaps they are unwilling to operate in a territory where they would be required to disclose every user's chat history to the government, which has potentially severe implications for certain groups of users and also for OpenAI's competitive interests.
> Chinese credit card is not allowed for OpenAI API.
A lot of online services don't accept Chinese credit cards, hosting providers for instance, so I don't think that is specific to OpenAI. The reason usually given for this is excessive chargebacks of (in the case of hosting) TOS violations like sending junk mail (followed by a charge-back when this is blocked). It sounds a like collective punishment a little: while I don't doubt that there are a lot of problem users coming from China, with such a large population that doesn't indicate that any majority of users from the region are a problem. I can see the commercial PoV though: if the majority of charge-back issues and related problems come from a particular region and you get very few genuine costumers from there¹ then blocking the area is a net gain despite potentially losing customers.
----
[1] due to preferring local variants (for reasons of just wanting to support local, due to local resources having lower latency, due to your service being blocked by something like the GFW, local services being in their language, any/all the above and more)
It's definitely not a commercial thing but political.
I'm located in Hong Kong and using Hong Kong credit cards have never been a problem with online merchants. I don't think Hong Kong credit cards are particularly bad with chargebacks or whatever. OpenAI has explicitly blocked Hong Kong (and China). Hong Kong and China, together with other "US adversaries" like Iran, N. Korea, etc are not on OpenAI's supported countries list.
If you have been paying attention, you'll know that US policy makers are worried that Chinese access to AI technology will pose a security risk to the US. This is just one instance of these AI technology restrictions. Ineffectual of course given the many ways to workaround them, but it is what it is.
I don't understand, if ChatGPT is blocked by the firewall, how do you know that ChatGPT is blocking IPs in return? Are there chinese IP ranges that are not affected by censorship that a citizen can use?
Okay but the point is that ChatGPT is blocked by the firewall.
EDIT: I read the comment below about Hong Kong, but I can't reply because I'm typing too fast by HN standards, so I'm writing it here and yolo: "I'm from Italy and I remember when ChatGPT was blocked here after the Garante della Privacy complaint, of course the site wasn't blocked by Italy but OpenAI complies with local obligations, so maybe it could be a reason about the block. API were also not blocked in Italy."
EDIT 2: if the website is not actually blocked (the websites that check if a website is reachable by mainland China lied to me) then I guess they are just complying to local regulations so that the entire website does not get blocked.
it's not blocked by the firewall. i'm in china and i can load openai's website and chatgpt just fine. openai just blocks me from accessing chatgpt or signing up for an account unless i use a VPN and US based phone number for signup
as in, if i open chat.openai.com in my browser without a VPN, from behind the firewall, i get an openai error message that says "Unable to load site" with the openai logo on screen
if the firewall blocks something the page just doesn't load at all and the connection times out
In so far as Hong Kong IPs are "Chinese IPs", we can access OpenAI's website, but their signup and login pages blocks Hong Kong phone numbers, credit cards and IP addresses.
Curiously, the OpenAI API endpoints works flawlessly with Hong Kong IP addresses as long as you have a working API key.
ChatGPT was not blocked by the GFW when it first released for a few weeks (if not months, I don't remember), but at that time OpenAI already blocked China.
The geo check only happened once during login at that time, with a very clear message that it's "not available in your region". Once you are logged in with a proxy you can turn off your proxy/VPN/whatever and use ChatGPT just fine.
OpenAI does not allow users from China, including Hong Kong.
Hong Kong generally does not have a Great Firewall, so the only thing preventing Hong Kong users from using ChatGPT is Open AI's policy. They don't allow registration from Hong Kong phone numbers, from Hong Kong credit cards, etc.
I'd say it's been pretty deliberate.
Reason? Presumably in alignment with US government policies of trying to slow down China's development in AI, alongside with the chips bans etc etc.
Sounds plausible - this is in line with the modern trend to posture by sanctioninig innocent people.
Of course, the only demographic these restrictions can affect are casuals. Even I know how to cirumvent this; thinking that this could hinder a government agent - who surely have access to all the necessary infrastructure by default - is simply mental.
Now former board member was a policy hawk. One of big beliefs is that china is at no risk of keeping up with US companies, due to them not having the data.
I wouldn't be surprised if OpenAI blocking China is a result of them trying to prevent them from generating synthetic training sets.
i know how: you need a verified phone number to open an account, and open ai does not accept chinese phone numbers or known IP phone numbers like google voice.
they also block a lot of data center IP addresses, so if you're trying to access chatgpt from a VPN running on blacklisted datacenter IP range (a lot of VPN services or common cloud providers that people use to set up their own private VPNs are blacklisted), then it tells you it can't access the site and "If you are using a VPN, try turning it off."
Probably because of the cost of legal compliance. Various AI providers also banned Europe because until they were ready for GDPR compliance. China has even stricter rules w.r.t. privacy and data control: a lot of data must stay inside China while allowing authorities access. Typically implementing this properly requires either a local physical presence or a local partner. This is why many apps/services have a completely segregated China offering. AWS's China region is completely sealed off from the rest of AWS, and is offered through a local partner. Similar story with Azure's China region.
I have no idea, but yiyan is short for wenxinyiyan(文心一言), which roughly translates to character-heart-one-(speech/word). Maybe someone who is Chinese could translate it better. So I don't think the name has anything to do with the model.
I do wonder what their backend is. They have the same 3.5/4 version numbering scheme that ChatGPT uses, which could be just marketing (and probably is), but I wonder.
> Also recently released: Yi 34B (with a 100B rumored soon), XVERSE-65B, Aquila2-70B, and Yuan 2.0-102B, interestingly, all coming out of China.
most AI papers are from Chinese people (either from mainland China or from Chinese ancestry living in other countries). They have a huge pool of brains working on this.
If your GPU has ~16GB of vram, you can run a 13B model in "Q4_K_M.gguf" format and it'll be fast. Maybe even ~12GB.
It's also possible to run on CPU from system RAM, to split the workload across GPU and CPU, or even from a memory-mapped file on disk. Some people have posted benchmarks online [1] and naturally, the faster your RAM and CPU the better.
My personal experience is running from CPU/system ram is painfully slow. But that's partly because I only experimented with models that were too big to fit on my GPU, so part of the slowness is due to their large size.
I get 10 tokens/second on a 4-bit 13B model with 8GB VRAM offloading as much as possible to the GPU. At this speed, I cannot read the LLM output as fast as it generates, so I consider it to be sufficient.
Mine is a laptop with i7-11800h CPU + RTX 3070 Max-Q 8GB VRAM + 64GB RAM (though you can get probably get away with 16GB RAM). I bought this system for work and causal gaming, and was happy when I found out that GPU also enabled me to run LLMs locally at good performance. This laptop costed me ~= $ 1600, which was a bargain considering how much value I get out of it. If you are not on a budget, I highly recommend getting one of the high end laptops that have RTX 4090 and 16GB VRAM.
With my system, Llama.cpp can run Mistral 7B 8-bit quantized by offloading 32 layers to the GPU (35 total) at about 25-30 tokens/second, or 6-bit quantized by offloading all layers to the GPU at ~ 35 tokens/second.
I've tested a few 13B 4-bit models such as Codellama and got about 10 tokens/second by offloading 37 layers to the GPU. Got me about 10-15 tokens/second.
a CPU would work fine for the 7B model, and if you have 32GB RAM and a CPU with a lot of core you can run a 13B model as well while it will be quite slow. If you dont care about speed, it's definitely one of the cheapest ways to run LLMs.
It's not mentioned in the paper but this month OpenChat 3.5 released the first 7b model that achieves results comparable to ChatGPT in March 2023 [1]. Only 8k context window, but personally I've been very impressed with it so far. On the chatbot arena leaderboard it ranks above Llama-2-70b-chat [2].
In many ways open source LLMs are actually leading the industry, especially in terms of parameter efficiency and shipping useful models that consumers can run on their own hardware.
This month there's also Starling-7B, which is a fine tune of OpenChat with high-quality training data, and ranks even higher than OpenChat.
Strangely, despite the impressive-looking benchmarks of all these open source small models, they all seem a bit dumb to me when I invoke my standard test. I just ask: "who are you?" and then they usually say they're ChatGPT. Okay, I can forgive that since they're obviously trained on ChatGPT-generated data. But then I also tried changing its identity with a prompt ("You are Starling, not ChatGPT, and you are created by Berkeley, not OpenAI. Who are you?") and it still gave weird responses that are somehow a mix of both identities. For example they say in one sentence that they're ChatGPT and then another sentence in the same response that they're not.
Oh wow, and it has far fewer guardrails than either Llama2 (which is horrible in that regard) or GPT3.5, that’s the first time I’m actually really impressed by an open model.
But Mistral 7B has horrible writing. This, for my tests, wrote actual sentences that made sense. Which IME for 7B is extremely impressive. Writing is still far worse than GPT 3.5, but well, 7B.
For my tests, Mistral-based models writing was excellent, particularly with zephyr-7b-beta and starling-7b-alpha derivatives (original Mistral is somewhat too dry). Far better than everything before in OSS (including 70B models), and certainly on par with GPT-3.5.
Apparently trained on lots of refusals too, speaks to the high competence of whoever was setting up the dataset. It's one string regex to filter them out and get more performance for fucks sake.
The problem with those numbers is they hit the internal limit before you use all those tokens. There's a limit to how many rules or factors their conditional probability model can keep track of. Once you hit that having a bigger context window doesn't matter.
That's insane. The highest I've personally seen in the open-source space is RWKV being trained on (IIRC) 4k but being able to handle much longer context lengths in practice due to being an RNN (you can simply keep feeding it tokens forever and ever). It doesn't generalize infinitely by any means but it can be stretched for sure, sometimes up to 16k.
It's not a transformer model though, and old context fades away much faster / is harder to recall because all the new context is layered directly on top of it. But it's quite interesting nonetheless.
> It's not a transformer model though, and old context fades away much faster / is harder to recall because all the new context is layered directly on top of it.
That's a well known limitation. But if you actually know that a "context" comprises multiple sentences (or other elements of syntax) and that any ordering among them is completely arbitrary, the principled approach is to RNN-parse them all in parallel and sum the activations you end up with as vectors - like in bag-of-words model, essentially enforcing commutativity on the network: that's pretty much how attention-based models work under the hood. The really basic intuition is just that a commutative and associative function can be expressed (hence "learned") in terms of vector sum modulo some arbitrary conversion of the inputs and outputs.
The numbers are high, but whether 8k is low depends on your use case. Do you want to process whole book chapters, or feed lots of related documents at the same time? If not, and you're just doing a normal question/answer session with some priming prompt, 8k is already a lot.
to be fair, I think the ability of these models to actually use these contexts beyond the standard 8k / 16k tokens is pretty weak. RAG based methods are probably a better option for these ultra long contexts
Most 4K models can use context window extension to get to 8K reasonably, but you're starting to see 16K, 32K, 128K (see YaRN for example) tunes become more common, or even a 200K version of Yi-34B.
YaRN is to blame for making llama.cpp misbehave if you accidentally zero-initialize the llama_context_params structure rather than calling llama_context_default_params :)
I'm finding Mistral good at creative literature and it is fairly adept at taking instructions, good enough for my purposes, and running locally on consumer CPU, the future of open source local models looks bright.
It depends on what you're doing... Just for reference, here is a small showcase of the capabilities that I've trained on a 13 billion parameter llama2 fine tune (done with qlora).
Amazing work, I've really wanted to get into knowledge graph generation with LLM's for the last year but haven't found the time. Glad to see someone making good progress on the idea!
I was busy adding `chat template` support to vLLM recently, so the model (and any others that implement it properly) will work seamlessly with a clone of the OpenAI chat/completions endpoint.
We're nearing a point where we'll just need a prompt router in front of several specialised models (code, chat, math, sql, health, etc)... and we'll have a local Mixture of Experts kind of thing.
1. Send request to router running a generic model.
2. Prompt/question is deconstructed, classified, and proxied to expert(s) xyz.
3. Responses come back and are assembled by generic model.
Is any project working on something similar to this?
I also think this is the route we are heading, a few 1-7B or 14B param models that are very good at their tasks, stitched together with a model that's very good at delegating. Huggingface has Transformers Agents which "provides a natural language API on top of transformers: we define a set of curated tools and design an agent to interpret natural language and to use these tools"
Some of the tools it already has are:
Document question answering: given a document (such as a PDF) in image format, answer a question on this document (Donut)
Text question answering: given a long text and a question, answer the question in the text (Flan-T5)
Unconditional image captioning: Caption the image! (BLIP)
Image question answering: given an image, answer a question on this image (VILT)
Image segmentation: given an image and a prompt, output the segmentation mask of that prompt (CLIPSeg)
Speech to text: given an audio recording of a person talking, transcribe the speech into text (Whisper)
Text to speech: convert text to speech (SpeechT5)
Zero-shot text classification: given a text and a list of labels, identify to which label the text corresponds the most (BART)
Text summarization: summarize a long text in one or a few sentences (BART)
Translation: translate the text into a given language (NLLB)
Text downloader: to download a text from a web URL
Text to image: generate an image according to a prompt, leveraging stable diffusion
Image transformation: modify an image given an initial image and a prompt, leveraging instruct pix2pix stable diffusion
Text to video: generate a small video according to a prompt, leveraging damo-vilab
It's written in a way that allows the addition of custom tools so you can add use cases or swap models in and out.
I like the analogy to a router and local Mixture of Experts; that's basically how I see things going, as well. (Also, agreed that Huggingface has really gone far in making it possible to build such systems across many models.)
There's also another related sense for which we want routing across models for efficiency reasons in the local setting, even for tasks for the same input modalities:
First, attempt prediction on small(er) models, and if the constrained output is not sufficiently high probability (with highest calibration reliability), route to progressively larger models. If the process is exhausted, kick it to a human for further adjudication/checking.
the first layer could be a mix of nlp and zero-shot classification to clarify the nature of the request.
Then using LLM deconstruct the request into several specific parts that would be sent to specialized LLMs.
Then stitch it back together at the end again with LLM as the summarization machine.
Problem is running so many LLMs in parallel means you need quite a bunch of resources.
Yeah, it shouldn't be too difficult to build this with python. I wonder why none of the popular routers like https://github.com/BerriAI/litellm have this feature.
> Problem is running so many LLMs in parallel means you need quite a bunch of resources.
Top of line MacBooks or Minis should be able to run several 7B or even 13B models without major issues. Models are also getting smaller and better. That's why we're close =)
Yeah that would save disk space! In terms of inference, you'd still need to hold multiple models in memory though, and I don't think we're that close to that (yet) on personal devices. You could imagine a system that dynamically unloads and reloads the models as you need them in this process, but that unloading and reloading would be pretty slow probably.
It was rumored a few months ago that this is how GPT-4 works. A controller model routing data to expert models. Perhaps also by running all the experts and comparing probabilities. So far as I know thats just speculation based on a few details leaked on Xitter though.
Current ~70B models like LLAMA 2 70B are on par wih ChatGPT 3.5. The best smaller models can appear on par at first glance, but they hallucinate at a much higher rate and lack knowledge of the world. GPT 4 ‘gets’ things at a deeper level and no open source model is even close.
A year is a good timeframe to evaluate things: the rest of the world seems to lag behind OpenAI by around 12-18 months, at least with LLMs and image generation.
On the other hand open source tech usually has additional features for controlling output that OpenAI never bothers to implement, like llama.cpp’s grammars or ControlNet. So in that sense open source is usually ahead of OpenAI in terms of customizability.
On the other hand gpt model are converging down. Gpt4 turbo degraded performance so much that now certain 13b produce more consistent results in reasoning. I've a marathon test here for example https://chat.openai.com/share/dfd9b9ae-7214-4dd7-ad20-7ee07a... with purposefully open ended and somewhat ambiguous request to see how models perform and gpt4 turbo chat is just not that good it confuses persons out, didn't pick the right one for abduction, didn't change topic when requested, when recalling persons picked the one from the wrong set, when asked to change language it didn't... It know a lot when asked zero shot questions, but when proving it's self consistency and attention it is nowhere near gpt4.
I don't think think using examples derived from ChatGPT are a fair comparison to the underlying models. OpenAI has many optimization tricks on the ChatGPT side that are unrelated to the underlying models being used.
We do know of course that ChatGPT is most likely using 4-Turbo from the decrease in latency and increase in unhelpful answers.
We cannot say that the models are "converging down" though. I don't remember the marketing materials but from the model side we all realize that the Turbo models have some type of quantization/optimization that makes them cheap and fast. 4-Turbo is 3x cheaper than 4, substantially quicker and provides better results than 3.5-Turbo. Amazing progress in my arena.
There were many rumors (and it probably was true) that OpenAI was hemorrhaging cash on GPT4 requests. So it makes tons of sense for them to sprint towards a turbo model at the expense of some ability. GPT4-turbo still is ridiculously powerful anyway.
Your credibility is killed by thinking using an API can guarantee which model you're getting. It's entirely black box. If OpenAI wants to lie to you, they can.
The point is exactly that the model people are experiencing is converging down with every subsequent update, and I even mentioned that it's nowhere near the orig gpt4, idk possibly read it again slower instead of jumping to credibility and whatnot.
We have to ban accounts that post this way. It's not what this site is for, and destroys what it is for. Moreover, we've had warn you about this more than once over the years.
I don't want to ban you, so if you'd please review https://news.ycombinator.com/newsguidelines.html and stick to the rules from now on, we'd appreciate it. That means no swipes and no personal attacks, among other things.
Edit: you've unfortunately been doing this in other recent threads too:
We really are going to have to ban you if this keeps up, so if you'd please make whatever change is needed to not post like this again, that would be good.
BoorishBear, please stop with the aggressive, condescending tone. You seem to have some productive counterarguments but they're getting lost in your disrespectful language.
I don't have a feature to flag your comments and your profile does not point to a way for me to message you privately so this was my only option to call you out
What you call "tone policing" I call "violating numerous HN guidelines":
> Be kind. Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes.
> When disagreeing, please reply to the argument instead of calling names.
> Please don't sneer, including at the rest of the community.
> Please don't post shallow dismissals, especially of other people's work.
I dont think OpenAI is ever going to ahead in image generation, they were lapped very soon after dall-e and every real workflow Ive seen uses Midjourney or Stable Diffusion. The reverse (GPT 4 vision) is well ahead of open source though
The original is leagues behind anything current, but DALL-E version 3 absolutely blows any state of the art generative model out of the water, including mid journey 5.2 and SDXL in terms of pure prompt accuracy and coherence.
Midjourney still has the edge in quality, but it's a moot point if it takes you 1000 v-rolls to get to your original vision.
If all you're generating is anime waifus then MJ/NovelAI/Niji will suffice, but generating prompts particularly featuring relatively complex scenes or actions are amazing on DALL-E 3.
And of course unfortunately, it goes without saying that open AI DALL-E is going to be the most restrictive in terms of censorship.
I generated these from DALL-E 3 instantly. Try to generate them in any other commercial offering. Go ahead. I'll wait...
A 80s photograph of the Koolaid Man breaking through the Berlin Wall.
Comic illustration set at a festive children's party. The main focus is on the magician who looks uncannily like a well-known fictional wizard. He's trying to say abracadabra but accidentally uses the killing curse.
SDXL has controlnet for other kinds of non-text input (like scribbles or just masks). The results are much easier to control in my opinion (a picture is worth thousands of prompt words).
For pure prompt coherence though I think ideogram is not far behind dalle 3.
EDIT: okay, I just tried Ideogram. It's not terrible and seems to do an okay job on text generation but I'd still say its a distant second compared to DALL-E 3. However, having the ability to maintain image continuity to make refinements of your initial image based on corrections like: "Make the building larger", or "He should have a more prominent forehead" is a game changer (e.g. InstructPix2Pix) and DALL-E 3's the only one that's got it.
SDXL and even some SD 1.5 checkpoints are great. My current workflow is:
1. Generate initial draft image in DALL-E 3 (iterate as necessary)
It's essentially the ONLY good InstructPix2Pix model.
2. Bring into InvokeAI
Inpaint with stuff that might be considered censored in DALL-E 3.
I'd like to see some proof of Ideogram - it looks... very mobile/instagrammy from the landing page. If you have an account, try out my prompts I'd like to see what you're able to produce.
> Midjourney still has the edge in quality, but it's a moot point if it takes you 1000 v-rolls to get to your original vision.
I can corroborate this. I wanted about 6 images for a presentation. I rolled ~300 MidJourney images. Most of them looked great, but none of them did what I wanted. I rolled ~50 DALL-E 3 images.
In the end, I only picked DALL-E 3 images. They were qualitatively not as good as MidJourney. For example when you zoom in then you see distortions. Or they're a bad fit for 16:9 format. But only DALL-E 3 was able to draw the things I wanted.
While yours is what you'd want, this arguably looks more like the super cheesy children's TV commercials back in the day and beats the ideogram take.
The Midjourney generations all appear to be referencing Halloween costumes or terrible cosplays, as if there are no trademarked koolaid men in their training set.
yeah, I did some rolls of this image for MJ but that was back in v4 and wasn't very impressed - doesn't look like its made much progress. The original commercials while silly looking are very visually identifiable as the Koolaid man.
I remember hearing that the first versions of MJ used the LAION image set for training data - I'd be curious to see if it has any training data containing the Koolaid man.
I did a search through my MJ history from the past year and added the results to the imgur link to include my attempts at generating the Koolaid man from v3/v4/v5.2.
Strong disagree on this from me as well. DALL-E 3 is miles ahead of the latest Midjourney/Stable Diffusion in image generation. The only real area it falls short vs the other options right now is in how nannying it can be.
I have found OpenAI to be the most superior in complex prompts especially where written messages like “Get better, Mom” are expected in the images. The distant second would be ideogram.
I am using these tools to send custom personal messages to close friends and family.
LLMs perhaps (I'm not sure either way, everything moves too quickly), but SDXL 1.0 (July 26, 2023) was a lot better than DALL•E 2 (6 April, 2022). I think DALL•E 3 (August 10, 2023) is a bit better than SDXL, but other than text generation their quality seems very close to me.
(That said, perhaps I'm Clever Hands-ing myself by only using SDXL for what it's good at. It's terrible at dragons every time I've tried that…)
The only thing I would argue is that JSON generation and function calling have noticeable decrease in quality of output in certain uses. I have had a hard time writing tests to measure it but its noticeable for my human eyes when I compare various implementations I have written.
No comment from me on the question in the title (because I don't know enough to have an opinion), but since others are discussing various open models I will mention another that I've been enjoying tonight: DeepSeek 67B
I’ve found Mistral OpenOrca is pretty much as good as GPT4-turbo for creative writing/analysis. Actually it tends to output very similar text, which is suspicious, but whatever it saves me a lot of money.
Mistral OpenOrca is very good at task following as well. Its slightly less reliable than GPT 3.5/4, but the difference in quality for my text processing tasks is pretty much a toss-up.
Long term it's almost unavoidable that open source LLMs start catching up. One factor that's worth considering too is cost. The open source community is much more resource constrained and they've really accelerated the pace of development in <30B parameter models.
Google and Meta and all the funded companies also are not even close to GPT 4, so I doubt cost is the biggest factor. Claude is the only model that is decent other than OpenAI's.
My understanding is that the models are pretty comparable, but nobody's reinforcement training set is not nearly as good as OpenAI's, so they're able to fine-tune their model to give more accurate results.
This is an industry where cost will be an issue. It reminds me of Rackspace and others trying to win with OpenStack “because open.” AWS and Azure won. Even Google is third.
The big players will win, and there will be a niche for open tools.
Google only lost because they couldn’t re-adjust their business for their paid products to not be similar to their advertising products.
I can only speak for the European enterprise scene, but AWS came first and in the beginning they went a very “Googley” route of not having very great support and very little patience for local needs. Then Azure came along with their typical Microsoft approach to enterprise, which is where you get excellent support and you get contacts into Microsoft who will actually listen and make changes, well, if the changes align with what Microsoft wants. I know Microsoft isn’t necessarily a popular company amongst people who’ve never interacted with them on an Enterprise level, but they really are an excellent it-business partner because they understand that part of being an Enterprise partner is that they let CTOs tell their organisation that they know X is having issues but that Microsoft headquarters is giving them half-hourly updates by phone. Sort of useless from a technical perspective, immensely useful for the CTO when 2000 employees can’t log into Outlook. Another good example is how when Teams rolled out with being on for all users by default, basically every larger organisation in the world went through the official channels and went “nonononono” and a few hours later it was off by default.
Now, when Amazon first entered the European market they were very “Googley” as I said, but once they realized Microsoft business model was losing them customers, they changed. We went from having no contacts to having an assigned AWS person and from not wanting to adopt the GDPR AWS actually became more compliant than even what Azure currently is.
Google meanwhile somehow managed to make the one product they were actually selling (education) worse than it was originally, losing billions of dollars on all the European schools who could no longer use it and be GDPR compliant. The Chinese cloud options obviously had similar data privacy issues to Google and never really became valid options. At least not unless China achieves the same sort of diplomatic relationship with the EU that the US has, which is unlikely.
So that’s the long story of why only two of the major cloud providers “won”. With the massive price increase, however, more and more companies are especially Azure for their own setups. This isn’t necessarily a return to having your own iron in the basement, often it’s going to smaller cloud providers and then having a third party vendor set something like Kubernetes up.
Right now, Microsoft is winning the AI battle. Not so much because it’s better, but because it comes with Office365. Office365 which was already a sort of monopoly on Office products, but is now even more so. A good example is again how Teams became dominant, even though it wasn’t really the best option for a while and is now only the best option because of how it integrates directly with your Sharepoint online which is where most enterprise orgs store documents these days. So too is copilot currently winking the AI battle for organisations who can’t really use a lot of the other options because of data privacy issues. So while copilot isn’t as good as GPT, it’s still what we are using. But if it ever gets too expensive, it’s not as secure as you may think. Especially not if we start seeing more training sets, or EU and US relations worsens.
I think the most likely outcome, at least here in the EU, is that anti-completion laws eventually takes a look at Office365 because of how monopolised it is. Or the EU actually follows through on their “a single vendor is a threat to national security” legislation and force half of the banking/energy/defense/andsoon industries to pick something other than Microsoft. Which will be hilariously hard, but if successful (which it probably won’t be because it’s hilariously hard) will lead to more open products.
Out of personal experience, open source LLMs did not yet reach the quality of GPT 3.5, despite multiple claims with dubious benchmarks. That said, they are already useful as of today and can even run on your local machine. I regularly use them with my Neovim plugin gen.nvim [1] for simple tasks and they save me a lot of time. I'm excited about the future!
I've been somewhat disappointed with the performance of the open models.
The claims of certain models outperforming GPT-3.5-Turbo and approaching GPT-4 fail to hold up to their benchmark results in real-world scenarios, potentially due to data contamination in assessments, based on my testing.
As noted in the linked survey paper, some models may outperform 3.5-Turbo in specific, narrow areas, depending on the model. Yet, we still lack a general model that definitively exceeds 3.5-Turbo in all respects.
I'm concerned that while we're still striving to reach 3.5-Turbo's performance level, OpenAI may unveil a new next-generation model, further widening the performance gap! Back in the summer, I had higher hopes that we would have surpassed the 3.5 threshold by now.
The performance gap has been surprisingly large. It is especially noticeable in areas requiring consistent structured output or tool use from the LLM. This is where open models particularly falter.
There are tools that you can use to force the model to give structured output, such as llama.cpp GBNF grammar for example. They're a bit harder to use than ask gpt4 but they do work pretty well for what I use it for.
Absolutely, pick a complicated problem and keep breaking it down with an existing model (whatever sota) until you have a consistent output for each step of your problem.
And then stitch all the outputs together into a coherent single response for your training pipeline.
After that you can do things like create q&a pairs about the input and output values that will help the model understand the relationships involved.
With that, your training loss should be pretty reasonable for whatever task you are training.
The other thing is, don't try and embed knowledge. Try and train thought patterns when specific knowledge is available in the context window.
Amethyst Mistral 13B q5 gguf is what I’m using most of the time now. Synthetic datasets are great to finetune with, there is no moat for having inaccessible literature data sets
I’m offline now because I’ve had too many ideas and domain names registered too soon after conversing with Chat GPT4
I’m open to the idea of people reacting to similar stimuli that cause ideas to be done at the same time, but I didn't like that experience and I can run these models on my M1 with LM Studio so easily
I do think some chats get flagged when the model says something seems novel, like Albert Einstein working at the patent office. Not worth making it my whole identity in wanting to prove, just the catalyst I needed to try 7B and 13B models seriously and I’m quite pleased
Is "Amethyst Mistral 13B" a llama fine tune? I searched for it on huggingface and only found the GGUF version, the link to the original model is broken
you think OpenAI employees watch your conversations and register your domain names? Or that OpenAI has a system in place where they try to profit from registering domain names people talk about?
or somebody in between, yes. random contractor, intern, someone at the data center, an analytics package nobody put scrutiny on, who knows but the difference doesn't matter after the experience, its a vulnerability surface we all know exists and have to trust at all times no matter what assurance we get, as it could change at any time
although I find the model to be very agreeable, it will disagree and generally tell me when it finds a concept "novel" if I identified a friction, I think certain words can be flagged for review to stand out in the sea of conversations it has
Open source options lacking parts of the ChatGPT approach which make it successful doesn't make the comparisons unfair it explains why ChatGPT wins right now. There's nothing stopping open source options from using the same MoE architecture, the solutions just don't (to the same effect act least) right now.
The history of open source has been that companies who have customers with massive customization requirements land on the open source side of the equation. Companies who don't view a component as core to their product often land in a similar state.
There is almost certainly at least one major firm that wants a GPT-5 like offering, but doesn't view the model as core to their business (Meta). It's also wholly unclear if large models are necessary - or simply convenient. In a similar veign, it's unclear that data must be labeled by humans - the open source data situation is getting better by the day.
I'd expect that we'll see OpenAI hold an edge for many years, maybe we'll see a number two player as well for the foundation model, but after that everybody else will base off an open source FM and maybe keep the fine tuning/model augmentation proprietary.
I've been using OpenHermes-2.5 [0] and NeuralHermes [1] which are both finetunes of the Mistral7B base model. The only objective test prompting I do is asking the models to generate a django timeclock/timesheets app. In this test they compare favorably to GPT-3.5. Also LMStudio [2] has a better UI than chatgpt and responses are much faster too (40tk/sec on my 2070).
`shiningvaliant-1.2-Q4_K_M` is my go-to. I appreciate that it doesn't top the boards in most metrics vs e.g. GPT-4, but I'm not in some A/B group on the quantization: it's more useful to me in practice more of the time.
I have it rigged up with a prompt about outputting markdown and wired up to `foo | glow -` and I get GPT-4 out when I want something to write JIRA tickets no one is going to read because it's better at that sort of thing.
A problem I have with the open source models is that they are all not remotely good in many languages other than English compared to the OpenAI models. I specifically need Dutch and the outputs are unusable for us.
Yeah, this can be an issue. Today, there are few specialized LLM's of high quality so you end up having to use a massive all-in-one model like GPT-4 to reach for the language you need.
There's movement here though and it will get better. GPT-SW3 is a new model developed by AI Sweden, trained specifically on the Nordic languages only + English.
And beyond this, you have TrustLLM which is a new project that aims to be a large, open, European model trained on the Germanic languages to start with: https://liu.se/en/research/trustllm
* Qwen 72B (and 1.8B) - 32K context, trained on 3T tokens, <100M MAU commercial license, strong benchmark performance: https://twitter.com/huybery/status/1730127387109781932
* DeepSeek LLM 67B - 4K context, 2T tokens, Apache 2.0 license, strong on code (although DeepSeek Code 33B it benches better) https://twitter.com/deepseek_ai/status/1729881611234431456
Also recently released: Yi 34B (with a 100B rumored soon), XVERSE-65B, Aquila2-70B, and Yuan 2.0-102B, interestingly, all coming out of China.
Personally, I'm also looking forward to the larger Mistral releasing soon as mistral-7b-v0.1 was already incredibly strong for its size.