> What if the program reading the CSV use an encoding depending on the locale?
> A program can’t magically know what encoding a file is using. Some will use an encoding depending on the locale of the machine.
Excel (for Mac at least) is a fucking pain in this regard. Just try this minimal UTF-8 example:
tmpfile=$(mktemp /tmp/XXXXX.csv); echo '“,”' > $tmpfile; open -a 'Microsoft Excel' $tmpfile
Hooray, you successfully opened
“,”
I'm not even sure in which encoding e280 9c2c e280 9d corresponds to that (not the usual suspect cp1252, nor any code page in the cp1250 to cp1258 range; easy to confirm with iconv).
One remedy is to add a BOM (U+FEFF) to the beginning of the file, but of course no one other than Microsoft (at least in my experience) uses this weird UTF-8 with BOM encoding (which the Unicode standard recommends against), so it breaks other programs correctly decoding UTF-8.
This means I can never share a non-ASCII CSV file with non-technical people. Always have to convert to .xlsx although it's usually easier for me to generate CSV. Then .xlsx opens me up to formatting problems, like phone numbers being treated as natural numbers and automatically displayed in scientific notation... Which means ssconvert or other naive conversion tools aren't enough, I need to use a library like xlsxwriter.
I'm not sure why it's so hard to just fucking ask when you don't know which encoding to use. (Plus it's not super hard to detect UTF-8. uchardet works just fine. Plus my locale is en_US.UTF-8, maybe that's a hint.)
Yes, it’s a pain. However there is a solution to open an UTF-8 CSV file (with Data > From Text and a few clicks), but it’s true that googling something like that is out of the mind of most people.
I know. But imagine sending a file with patronizing instructions on how to open it, which will still be ignored — the other end will double click on the file, see garbage, and get back to you.
The problem is person at receiving end won't follow instructions for opening in Office, software they're moderately familiar with, but not enough to navigate a special open file process even with hand-holding.
I sincerely do not think the solution is instruct receiving person to install an even less familiar Office alternative.
Typically the biggest problem is saving from Excel to UTF-8 since it really really wants to just use its locale-dependent default charset. Opening is trouble too, but people are occasionally good about noticing that and figuring out how to get to the import options... the re-conversion to a different charset on save happens transparently so people don't notice.
In recent versions there finally seems to just be a new "UTF-8 CSV" option in the Save As dialog.
> Just use utf8 right? But wait…
> What if the program reading the CSV use an encoding depending on the locale?
> A program can’t magically know what encoding a file is using. Some will use an encoding depending on the locale of the machine.
Excel (for Mac at least) is a fucking pain in this regard. Just try this minimal UTF-8 example:
Hooray, you successfully opened I'm not even sure in which encoding e280 9c2c e280 9d corresponds to that (not the usual suspect cp1252, nor any code page in the cp1250 to cp1258 range; easy to confirm with iconv).One remedy is to add a BOM (U+FEFF) to the beginning of the file, but of course no one other than Microsoft (at least in my experience) uses this weird UTF-8 with BOM encoding (which the Unicode standard recommends against), so it breaks other programs correctly decoding UTF-8.
This means I can never share a non-ASCII CSV file with non-technical people. Always have to convert to .xlsx although it's usually easier for me to generate CSV. Then .xlsx opens me up to formatting problems, like phone numbers being treated as natural numbers and automatically displayed in scientific notation... Which means ssconvert or other naive conversion tools aren't enough, I need to use a library like xlsxwriter.
I'm not sure why it's so hard to just fucking ask when you don't know which encoding to use. (Plus it's not super hard to detect UTF-8. uchardet works just fine. Plus my locale is en_US.UTF-8, maybe that's a hint.)