This seems like a fairly useful tool, but I'd be a bit cautious - the tradition of poring over a carefully collected and curated data set using tools you understand the strengths and weaknesses of shouldn't be lightly tossed aside. That process can help researchers spot unusual anomolies that lead to novel discoveries, while an automated tool might just discard all outliers.
Incidentally, the far more concerning issue is the use of approaches like this to generate data which opens the doors to a plague of hard-to-detect scientific fraud. In that past, many such high-visibility fraudulent efforts have been detected because the fraudsters duplicated data (or reversibly processed old data in some manner) that was spotted by others in the field, e.g.
Often these fraudulent productions are inspired by the desire to be first to publish, a situation in which everyone thinks they know how a system works but they're all rushing to get credit (and hence Nobel Prizes and patents etc.) by generating the data from a 'successful experiment' before anyone else can.
100% agree with you. There are two things driving my work with this demo:
(1) A lot of researchers are bad at writing code, but they can audit it. This is true for sociologists, psychologists, etc. so I'm hoping something like this can help.
(2) Philosophically, I disagree with the debates that LLMs can't produce new knowledge. I think there's merit to this if we're talking about whether the LLM neural network itself synthesizes new knowledge via its weights... However, why can't we have an LLM try and merge multiple data sets, analyze them, and report back to a human?
To your point + concerns, I think a human still needs to be very careful and actually revisit the analysis for any promising findings, but at least some of the grunt work can be taken care of!
> A lot of researchers are bad at writing code, but they can audit it.
Can they really?
It seems to me that users of this would be those that can't write the code they need. How would they be in a position to audit what they get?
I'll use myself as an example. I love Pandas + Scikit Learn, but am by no means an expert. Every time I want to build a logistic regression, I have to go back to the docs to review the API.
When I was doing my Master's degree in "Social Science of the Internet" at the Oxford Internet Institute (a sociology + data science program, with many students coming from non-STEM backgrounds), everyone was comfortable debating P values, standard errors in regressions, etc. but many students were extremely intimidated by reading Python docs and/or using the REPL.
It's a really cool area of putting AI in a feedback loop (langchain-like) with its own tools, which I think is where the magic happens, and where we'll see much more happening in the future. This should really super-charge engineers doing stuff in areas where they're not super-comfortable in, but comfortable enough to verify the AI isn't doing anything stupid.
I made something vaguely similar for your local terminal[0] and other locally-available tools.
The idea is to give you a chat with an assistant that can use these local tools. Here it's Python for data analysis, in my case it's more "give it access to your terminal, so it can answer questions / do tasks on your local machine" which is something web-based options can't do right now.
I.e. ask it about your system details (processes, wifi) or to do things (configure something). Have it automatically run the relevant commands, analyze the output, and respond either in natural language or i.e. plot a chart.
AutoGPT[1] is another very interesting project in this area.
Asking the LLM if it "understands" and only proceeding if it says yes feels very weird to me. Do we really expect the LLM to be able to introspect in that way and give a meaningful answer?
Nope! I'm not trying to suggest the LLM legitimately understands my query from a conceptual perspective via that question.
What I am doing in that prompt is ensuring the LLM can follow the instructions. I specifically ask it to write "yes" if it does. If it can't do that part, then I don't want it to even generate code or try to analyze my data.
Hence why I treat it as an assertion failure if it can't follow that instruction, and thus exit the app.
Thanks for reading the code + for the very thoughtful question.
The broader package I’m working on (PhaseLLM) is specifically focused on devtools for observability and robustness of LLM-powered products. I agree with you that there are lots issues with LLMs and taking them to production. I’m hoping products like this + making them robust will help improve the research as well as the UX.
I'm not sure about this approach. From what I have seen, most researchers have no idea how to get their data in a format which can be efficiently analysed.
Once you have that, it's trivial to do any kind of statistical analysis. In R, a regression is simply lm(y ~ x1 + x2 + ... + xn).
You can always look up how an API works, but thinking about data in terms of structures is what hinders effective analysis in most cases.
Totally appreciate the feedback and I agree with you that a well structured data set can be trivially analyzed. Heck, at that point you can use drag and drop stats packages too.
The data set I used for the demo has strings for income categories and a mix of categorical variables that the LLM had to transform, which is incredibly promising.
The insights that Claude generated also imply that it can do follow-up analysis.
This is less of a “hey write my regression code for me” and more of a “suggest the analysis, do it, find insights, and run follow up analyses”. That’s way more powerful and interesting.
Good points, and I really appreciate your work. You are addressing a real problem.
I'm simply skeptical that someone who can make use of Claude (or any LLM) for data analysis is in need of making use of it. Let's hope I'm too pessimistic here.
Anecdotally, my wife - a researcher in management accounting who does a lot of analysis of corporate data was very excited about this tool because it allows her to explore the dataset in almost natural language and have a starter Python code base to tinker with.
I have seen her use Python. She uses it like a research notebook. Sequential pipeline like analysis steps. Any little change to a step, and she will run the whole thing :-)
I did something similar to this but got stuck that the code generated would sometimes work, sometimes not for identical prompts. I also found that as an expert in the topics it was easy to write a prompt that would generally build a reasonable data pipeline but I couldn't imagine if I just had some data, but not the expertise, I could do the same. How do you account for these issues?
Fantastic questions! Re: working/not working at times -- this is still an issue. It's why I'm building PhaseLLM more broadly (https://github.com/wgryc/phasellm) -- need a robust pipeline that can also "reset" parts of itself if an LLM makes errors or mistakes.
I think there’s a lot of value here in empowering business users or more operational folks to use data without needing familiarity with a tool or language meant for data science
Incidentally, the far more concerning issue is the use of approaches like this to generate data which opens the doors to a plague of hard-to-detect scientific fraud. In that past, many such high-visibility fraudulent efforts have been detected because the fraudsters duplicated data (or reversibly processed old data in some manner) that was spotted by others in the field, e.g.
https://en.wikipedia.org/wiki/Sch%C3%B6n_scandal
Often these fraudulent productions are inspired by the desire to be first to publish, a situation in which everyone thinks they know how a system works but they're all rushing to get credit (and hence Nobel Prizes and patents etc.) by generating the data from a 'successful experiment' before anyone else can.