Hacker News new | past | comments | ask | show | jobs | submit login

I agree for one-offs and for simple mappings. If I had to do this problem as part of some personal workflow used only by myself, then I would just use `pandas` or some equivalent for the entire thing and have it live in a jupyter notebook.

However, if the mapping is even somewhat complicated, or this pipeline has to be shared and productized in some way, then it would be better to load the data using some `pandas` like tool, store it on a `tsql` flavored database or datalake, and then exported as a .csv file using a native tool or another `pandas` equivalent again.

Having a pipeline live solely on a notebook that is passed around leaves too much risk for dependency hell and relying on myself to create the csv as needed is too brittle. Either have the pipeline live on its own container that can be started and run as needed by anyone, or dump the relevant data into the datalake and perform all the needed transformations there where the workflow can be stored and used repeatedly.




> Either have the pipeline live on its own container that can be started and run as needed by anyone

This is what I do. Each pipeline starts in a fresh conda environment, or sometimes entirely fresh container (if running in GitLab CI or GitHub actions, for example) and does a pip install to pinned packages. So every run is a fresh install and there’s no dependency hell.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: