vishnurnair's comments

vishnurnair · 2024-09-26T13:05:16 1727355916

You don't have to post the noise. If you are publishing a paper, you already have a solid experiment in place. What is necessary is a way to reproduce that research, and the final dataset used is an important piece of the puzzle. Of course, if the changelog of the data exists, that might be useful, just to see if the authors are modifying the data to cherry-pick the results they are publishing.

vishnurnair · 2024-09-26T12:54:07 1727355247

Solutions for specific problems I mentioned do exist for niches. But none of them can solve it well for all niches, which is what I believe is necessary. What we need is for all datasets from scientific papers to be easily accessible and licensed like code.

afandian · 2024-09-26T13:00:44 1727355644

I think the diversification is a strength, honestly.

CERN and high-energy physics has _massive_ datasets. Making them all available on-line isn't practical.

Other researchers may have one or two files that they want to cite as part of a paper.

Healthcare research may have confidential data for which there are specific types of access control required.

I don't think GitHub would be financially sustainable or scalable if it was able to host millions of one-file repos, alongside repos that grow terabytes per day, alongside those that hold highly sensitive data.

Asraelite · 2024-09-27T10:08:13 1727431693

There's a lot of things that don't fit on GitHub either. Sometimes because it's closed source, sometimes because the data is too big, sometimes because parts of the data have legal restrictions on distribution and require the user to get it themselves from a different source.

The usual solution is to make a skeleton repo with only partial or no code, the real substance being a README that explains what the project is and instructions on how to use it. GitHub is a social network as well as a code warehouse in a way, and this comes with benefits. The same system for stars, issues, user groups, permissions etc. extends across all projects regardless of whether the code/data is actually hosted on GitHub. Something like this for science could be of huge benefit.

vishnurnair · 2024-09-26T13:13:49 1727356429

At the end of the day, we need scientific research to be reproducible. If you are using some confidential dataset for making conclusions, how will people check if what they are saying is true or not? You have to show your experiment to publications like Nature or Elsevier etc, in order for you to get recognition. I believe the standard should be that anyone can check, if they want. There could be some caveats, but I believe, in most cases, scientific research should be reproducible and the dataset used is very important for reproducibility.

afandian · 2024-09-26T13:26:30 1727357190

You are making quite broad statements, and they don't seem to take into account the diversity of research and scholarly practice. A lot of what you suggest is happening already, but it's far from perfect. The existing solutions all have trade-offs (legal, cost, social, technological) .

I think it would make for a stronger argument to acknowledge and identify the existing solutions and practice, and evaluate them against your criteria.

vishnurnair · 2024-09-26T12:48:46 1727354926

There could be field-specific ones like Dolthub, but what I believe is needed is datasets for all fields. GitHub isn't field-specific. There is no GitHub for hospitals, physics, AI, etc. It's any field.

vishnurnair · 2024-09-26T12:46:52 1727354812

Huggingface isn't meant for all scientific data, it's mostly datasets for a niche. They do an excellent job though.

vishnurnair · 2024-08-23T16:09:13 1724429353

You make valid points about the common understanding of "data-driven" decisions. I want to intentionally broaden the definition of data to include a wider range of inputs influencing our choices.

Let’s take the hiring process example. While it feels like a "gut feeling," I'd argue it's still rooted in data - just not the clearly quantifiable kind. This "data" includes past experiences, cultural conditioning, and even evolutionary predispositions.

I think in the current age of LLMs and multi-modal models, you can consider a podcast as data. For example, you can use the podcast to train a model. We would consider it as ‘data’ that the model is trained on, in terms of LLMs. So why not consider it as data, when we humans listen to it?

The brain is like a neural network, so the networks have some weights and biases. Every time we add new data to the brain, through various forms like sight, hearing, smell, taste, touch, etc., we modify the weights and biases of our brain. Regardless of what the data is or its relevance, it changes. So, arguably, anything we consume could be data.

I don’t think you can make the argument that Earth is data-driven. The Earth follows the laws of physics. It’s just an object without the ability to process information, so just because it is following a mathematical path, it doesn’t mean it is data-driven.

The core argument isn't that all decisions are based on spreadsheet-style data, but that what we perceive as intuition is complex processing of various inputs accumulated over time.

By expanding our concept of "data," we can gain deeper insights into human decision-making, including seemingly instinctual or emotional choices. I want to complement the current colloquial understanding of data by recognizing the full spectrum of information influencing our decisions.