Hacker News new | past | comments | ask | show | jobs | submit login

Solutions for specific problems I mentioned do exist for niches. But none of them can solve it well for all niches, which is what I believe is necessary. What we need is for all datasets from scientific papers to be easily accessible and licensed like code.



I think the diversification is a strength, honestly.

CERN and high-energy physics has _massive_ datasets. Making them all available on-line isn't practical.

Other researchers may have one or two files that they want to cite as part of a paper.

Healthcare research may have confidential data for which there are specific types of access control required.

I don't think GitHub would be financially sustainable or scalable if it was able to host millions of one-file repos, alongside repos that grow terabytes per day, alongside those that hold highly sensitive data.


There's a lot of things that don't fit on GitHub either. Sometimes because it's closed source, sometimes because the data is too big, sometimes because parts of the data have legal restrictions on distribution and require the user to get it themselves from a different source.

The usual solution is to make a skeleton repo with only partial or no code, the real substance being a README that explains what the project is and instructions on how to use it. GitHub is a social network as well as a code warehouse in a way, and this comes with benefits. The same system for stars, issues, user groups, permissions etc. extends across all projects regardless of whether the code/data is actually hosted on GitHub. Something like this for science could be of huge benefit.


At the end of the day, we need scientific research to be reproducible. If you are using some confidential dataset for making conclusions, how will people check if what they are saying is true or not? You have to show your experiment to publications like Nature or Elsevier etc, in order for you to get recognition. I believe the standard should be that anyone can check, if they want. There could be some caveats, but I believe, in most cases, scientific research should be reproducible and the dataset used is very important for reproducibility.


You are making quite broad statements, and they don't seem to take into account the diversity of research and scholarly practice. A lot of what you suggest is happening already, but it's far from perfect. The existing solutions all have trade-offs (legal, cost, social, technological) .

I think it would make for a stronger argument to acknowledge and identify the existing solutions and practice, and evaluate them against your criteria.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: