Hello, I would like to take this opportunity and ask for help here, about using ...

WeMoveOn · on Feb 1, 2024

> much of the work is repetitive, but it comes with its edge cases

for the repetitive stuff, just use copilot embedded in whatever editor you use.

the edge cases are tricky, to actually avoid these the model would need an understanding of both the use case (which is easy to describe to the model) and the code base itself (which is difficult, since description/docstring is not enough to capture the complex behaviors that can arise from interactions between parts of your codebase).

idk how you would train/finetune a model to somehow have this understanding of your code base, I doubt just doing next token prediction would help, you'd likely have to create chat data discussing the intricacies of your code base and do DPO/RLFH to bake it into your model.

look into techniques like qlora that'll reduce the needed memory during tuning. look into platforms like vast ai to rent GPUs for cheap.

RAG/Agents could be useful but probably not. could store info about functions in your codebase such as the signature, the function it calls, its docstring, and known edge cases associated with it. if you don't have docstrings using a LLM to generate them is feasible.

anotherpaulg · on Feb 1, 2024

You could try using aider [0], which is my open source ai coding tool. You need an OpenAI API key, and aider supports any of the GPT 3.5 or 4 models. Aider has features that make it work well with existing code bases, like git integration and a "repository map". The repo map is used to send GPT-4 a distilled map of your code base, focused specifically on the coding task at hand [1]. This provides useful context so that GPT can understand the relevant parts of the larger code base when making a specific change.

[0] https://github.com/paul-gauthier/aider

[1] https://aider.chat/docs/repomap.html

jstummbillig · on Feb 1, 2024

I gave aider an afternoon this week (which is not a lot of time to learn anything, of course). It did too many wild things to the project and repository I was using it with (Rails 7 base) to comfortably explore this further – for now.

Paul, if you are up for that, it would be tremendously helpful to have a video(series) that shows what aider can realistically do given a boring, medium sized CRUD code base. The logs in the examples are too narrow and also build not intuition about what to do when things go wrong.

anotherpaulg · on Feb 1, 2024

Thanks for trying aider, and sorry to hear you had trouble getting the hang of it. It might be worth looking through some of the tips on the aider github page:

https://github.com/paul-gauthier/aider#tips

In particular, this is one of the most important tips: Large changes are best performed as a sequence of thoughtful bite sized steps, where you plan out the approach and overall design. Walk GPT through changes like you might with a junior dev. Ask for a refactor to prepare, then ask for the actual change. Spend the time to ask for code quality/structure improvements.

Not sure if this was a factor in your attempts? But it's best not to ask for a big sweeping change all at once. It's hard to unambiguously and completely specify what you want, and it's also harder for GPT to succeed at bigger changes in one bite.

pests · on Feb 1, 2024

I second this. Great tool. I always plug your tool in these threads too when relevant. The tree-sitter repo map was a great change. Thank you.

nl · on Feb 1, 2024

Unclear exactly what you are expecting to do here, but in any case you shouldn't need to train on your own codebase.

The idea is you put your code into the best possible model (GPT4) and tell it what you want and it generates code.

rickstanley · on Feb 1, 2024

My goal is to (1) try running an A.I. locally and see if it works, out curiosity, and (2) delve into A.I. concepts. I do not intend to use it as the definitive tool to code for me, and maybe I shouldn't.

Realistically, since we are in a Azure ecosystem, I would use Codex to try out a solution.

Closi · on Feb 1, 2024

I think you are going the wrong way - I would start with the best possible AI (which will probably be GPT4) and see if it can do it, and then walk backwards to local AI deployments (which are currently significantly weaker).

nl · on Feb 2, 2024

> My goal is to (1) try running an A.I. locally and see if it works, out curiosity, and (2) delve into A.I. concepts.

No this isn't what I mean.

Unrelated to where the model runs, it is unclear how you are specifying what you want the model to do. Usually you input the code and some natural language instructions into the context of the model and it will follow the instructions to generate new code.

But you say you are fine tuning, and it's unclear why this is.

There could be good reasons to do this: Maybe you have an in-house language it doesn't know, or maybe you are fine tuning on old-data -> old-transform-code -> new data + errors or something.

But usually you don't need to do fine tuning for this task.

rickstanley · on Feb 1, 2024

Ever since I started doing this exercise, I've been excited about the future, with LLMs helping us.

Now I definitely share Linus' sentiment [1] on this topic.

It would be incredible to feed an A.I. some code and request a bug tracking from it.

[1]: https://blog.mathieuacher.com/LinusTorvaldsLLM/

Buttons840 · on Feb 1, 2024

Is writing the code the hard part, or is ensuring what you've written is correct the hard part? I'd guess the latter and AI will not ensure the code is correct.

Can you record input and output at some layers of your system and then use that data to test the ported code? Make sure the inputs produce the same outputs.

skybrian · on Feb 1, 2024

Yes, and I also imagine some kind of AI thing would be useful for reading logs and writing other tests that document what the system does in a nice bdd style.

But you still have to read the tests and decide if that's what you want the code to do, and make sure the descriptions aren't gobbledygook.

mgaunard · on Feb 1, 2024

If the code to write is repetitive, then just write some code that does it; no AI needed.

Presumably what matters in this project is correctness, not how many unnecessary cycles you can burn.

mhb · on Feb 1, 2024

Maybe this is not relevant to you, but would it make any sense to first try Copilot with IntelliJ or Visual Studio?

high_priest · on Feb 1, 2024

Copilot has such a narrow input space, that it is not going to help in this case. Here, just saved you $$

hackstack · on Feb 1, 2024

What do you mean by this? I’ve been getting great mileage out of copilot, especially since learning about the @workspace keyword.

Could you quantify your criticism a bit more? (genuinely asking)

imp0cat · on Feb 1, 2024

You can use @workspace to tell him to ingest more of your workspace as required.

candiddevmike · on Feb 1, 2024

> We are looking from the perspective of A, to map the entries for B (inbound), and after the request to B is returned, we map back to A's model (outbound). This is a chore, and much of the work is repetitive, but it comes with its edge cases that we need to look out for and unfortunately there isn't a solid foundation of patterns apart from the Domain-driven design (DDD) thing.

This sounds like a job for protobufs or some kind of serialization solution. And you already know there are dragons here, so letting a LLM try and solve this is just going to mean more rework/validation for you.

If you don't understand the problem space, hire a consultant. LLMs are not consultants (yet). Either way, I'd quit wasting time on trying to feed your codebase into a LLM and just do the work.

rickstanley · on Feb 1, 2024

Thanks. I would be interesting though, to see how things plays out. Fortunately, it's no a requirement, just a "side quest".

wokwokwok · on Feb 1, 2024

> much of the work is repetitive, but it comes with its edge cases that we need to look out for

Then don't use AI for it.

Bluntly.

This is a poor use-case; it doesn't matter what model you use, you'll get a disappointing result.

These are the domains where using AI coding currently shines:

1) You're approaching a new well established domain (eg. building an android app in kotlin), and you already know how to build things / apps, but not specifically that exact domain.

Example: How do I do X but for an android app in kotlin?

2) You're building out a generic scaffold for a project and need some tedious (but generic) work done.

Example: https://github.com/smol-ai/developer

3) You have a standard, but specific question regarding your code, and although related Q/A answers exist, nothing seems to specifically target the issue you're having.

Example: My nginx configuration is giving me [SPECIFIC ERROR] for [CONFIG FILE]. What's wrong and how can I fix it?

The domains where it does not work are:

1) You have some generic code with domain/company/whatever specific edge cases.

The edge cases, broadly speaking, no matter how well documented, will not be handled well by the model.

Edge cases are exactly that; edge cases; the common medium of 'how to x' does not cover edge cases; the edge cases will not be covered and the results will require you to review and complete them manually.

2) You have some specific piece of code you want to refactor 'to solve xxx', but the code is not covered well by tests.

LLMs struggle to refactor existing code, and the difficulty is proportional to the code length. There are technical reasons for this (mainly randomizing token weights), but tldr; it's basically a crap shot.

Might work. Might not. If you have no tests who knows? You have to manually verify both the new functionality and the old functionality, but maybe it helps a bit, at scale, for trivial problems.

3) You're doing something obscure or using a new library / new version of the library.

The LLM will have no context for this, and will generate rubbish / old deprecated content.

Obscure requirements have an unfortunate tendency to mimic the few training examples that exist, and may generate verbatim copies, depending on the model you use.

...

So. Concrete advice:

1) sigh~

> a friend of mine came and suggested that I use Retrieval-Augmented Generation (RAG), I have yet to try it, with a setup Langchain + Ollama.

Ignore this advice. RAG and langchain are not the solutions you are looking for.

2) Use a normal coding assistant like copilot.

This is the most effective way to use AI right now.

There are some frameworks that let you use open source models if you don't want to use openAI.

3) Do not attempt to bulk generate code.

AI coding isn't at that level. Right now, the tooling is primitive, and large scale coherent code generation is... not impossible, but it is difficult (see below).

You will be more effective using an existing proven path that uses 'copilot' style helpers.

However...

...if you do want to pursue code generation, here's a broad blueprint to follow:

- decompose your task into steps

- decompose you steps in functions

- generate or write tests and function definitions

- generate an api specification (eg. .d.ts file) for your function definitions

- for each function definition, generate the code for the function passing the api specification in as the context. eg. "Given functions x, y, z with the specs... ; generate an implementation of q that does ...".

- repeated generate multiple outputs for the above until you get one that passes the tests you wrote.

This approach broadly scales to reasonably complex problems, so long as you partition your problem into module sized chunks.

I personally like to put something like "you're building a library/package to do xxx" or "as a one file header" as a top level in the prompt, as it seems to link into the 'this should be isolated and a package' style of output.

However, I will caveat this with two points:

1) You generate a lot of code this way, and that's expensive if you use a charge-per-completion API.

2) The results are not always coherent and functions tend to (depending on the model, eg. 7B mistral) inline implementations for 'trivial' functions instead of using functions (eg. if you define Vector::add, the model will 50/50 just go a = new Vector(a.x + b.x, a.y + b.y)).

I've found that the current models other than GPT4 are prone to incoherence as the problem size scales.

7B models, specifically, perform significantly worse than larger models.

eurekin · on Feb 1, 2024

Very well researched!

I'd add the MR review use case.

I have limited success with feeding a LLM (dolphin finetune of mixtral) a content of a merge request coming from my team. It was few thousand lines of added integration test code and I just couldn't be bothered/had little time to really delve.

I slapped the diff and used about 10 prompt strategies to get anything meaningful. So my first initial impressions were: clearly it was finetuned on too short responses. It kept putting in "etc.", "and other input parameters", "and other relevant information". At one point I was ready to give up; it clearly hallucinated.

Or that's what I thought: turned out there was some new edge case of a existing functionality added that was added, without ever me noticing (despite being on the same meetings).

I think it actually saved me a lot of hours or pestering other team members.