Hacker Newsnew | past | comments | ask | show | jobs | submit | winety's commentslogin


We used to have warthogs in Africa.

Def not-cute.

https://en.wikipedia.org/wiki/Warthog


They have an unbelievably small brain case for a medium-sized quadruped mammal


I'd say: Spacing matters, f(1,2) is different from f(1, 2). Just use a semicolon to separate the numbers, e.g. f(1,2; 2).


Maths is typically written fast, so subtle spacing is too weak a signal to distinguish meanings. Where I studied maths, we even crossed our zeds horizontally to make it easier to distinguish from carelessly written 2s. Similarly, even though the country uses commas, the uni uses periods, for the benefit of clarity. Almost noone uses semicolons in function application in writing.


If everybody did that, we'd wouldn't have Zoozve...


It’s crazy, and that’s why hyphenation doesn’t really work that way. Both TeX and web browsers use Liang’s algorithm to split words. [1] It uses so-called patterns, which are short substrings of words in which numbers indicate how to divide the word. For example, the pattern “s1h” indicates that in the word “fishing”, a divider can be inserted between “s” and “h”. Patterns compete and can override each other, and the whole thing is quite complicated. As for your example with Qishan — the “s-h” probably overrides the “i-s” pattern. (There have been a number of articles in TeX journals that explain the algorithm, such as [2].)

In CSS, automatic hyphenation must be explicitly turned on, see [3].

In TeX and in CSS, hyphenation points can be marked explicitly: in TeX with the \- macro and in CSS with the ­ or U+00AD character. In TeX you can also override the automatic division with \hyphenation{}.

The splitting algorithm in CSS is worse than the one in TeX, because it has to work in real time and because (good) splitting patterns are often missing.

[1]: https://www.tug.org/docs/liang/

[2]: https://www.fi.muni.cz/usr/sojka/papers/euro01.pdf

[3]: https://developer.mozilla.org/en-US/docs/Web/CSS/hyphens


It seems very clear that Amazon's default approach is to insert hyphens based on a whitelist of correct hyphenation points.

And that is what the algorithm you refer to does! Your links [1] and [2] speak specifically in terms of the patterns being a form of data compression that is applied to lighten the storage requirements of a big list of correct hyphenation points. The hyphenation algorithm is just that you check the word you want to hyphenate against the Master List Of All Words and learn where hyphenation is allowed. The patterns are a form of data preprocessing that makes that algorithm more efficient (here, in terms of space requirements) without changing the output.

So what we need is a way to extend the set of precomputed rules whenever we want to use a word that wasn't in the original dictionary. As noted, TeX provides this with the \hyphenation{} command. Why is this not available in CSS?

Suppose I want to write an ebook that doesn't make mistakes on the level of "fis-hing" and "f-orest". [Another example I'm not making up; the Kindle app is convinced that "Ts-inghua" is correct hyphenation.] How do I include the hyphenation information in my document?


> The splitting algorithm in CSS is worse than the one in TeX, because it has to work in real time and because (good) splitting patterns are often missing.

Surely that's only the case for real-time renderers like web browsers.

If you're creating a layout engine for printed media that uses CSS as the way for authors/setters to specify style, couldn't it implement a better, slower splitting algorithm? Using an internal (or pluggable?) dictionary of hyphenations?


You could and that's basically what TeX does, just without the CSS. There are even typesetting systems similar to (La)TeX, that can take XML as input, see Context [1] or Sile [2]. They’re just a step away from using HTML + CSS. Why isn’t there such system? I do not know.

[1]: https://wiki.contextgarden.net/XML

[2]: https://sile-typesetter.org/


You could probably write some JS to reimplement the better algorithm and insert the ­ hyphenation hints


> There is a problem because some licenses require attribution, but ignoring that...

Surely the solution would be to give credit to every author from the training corpus. I am looking forward to the 10 000 lines of copyrights in every header. :P

If Microsoft had trained it on its own code, there would be no such problems. Surely a company as large as Microsoft has produced enough code over the years to create a large enough training dataset.


> If Microsoft had trained it on its own code, there would be no such problems.

I keep seeing this sentiment from the GPL/"laundering" side of the debate.

Believe me, Microsoft wouldn't have released this thing (after what, 6 months of beta testing?) if they thought they had any "problems" at all.

I'm not saying I don't sort of agree with you, but is there no room for what's actually _likely_ to happen in this debate? Because as best as I can tell, they aren't going to see any real legal issues from this.

(There's also an option to remove generations that result in a collision with actual GitHub code, just fyi)

I feel like when the singularity happens HN is going to be flooded with programmers mad that they got automated away despite it very much being one of the primary goals of computer science and software engineering. This stuff is a kind of just a fact of life now.

Salesforce trained models (on GitHub) competitive with copilot without needing to own GitHub. I would spend less time worrying about how to lawyer up and more time figuring out how you're going to adapt to these new tools. That's the gig.


Microsoft made a bet that releasing Copilot will mean more profits than the legal issues might cost them. This doesn't mean anything if there is or isn't a problem with it.

The simply way to test the legal theory behind copilot would be to write a AI that write music notes, using music scraped from youtube or any other large music library. The idea that one can train on "public available material" and produce algorithms that output large chunk of copyrighted material is a bit untested in court, but go against the wrong target and we will quickly see a response. We have actually seen some traces of this with news bots that scrapes news site and produce "novel" interpretation of existing news, especially sports news.


This is what I'm talking about. Are we commenting on a news report of someone actually doing what you're describing - filling a suit or legal action of any kind against MS for this?

No, we're not. Further, Amazon just announced a similar product and Salesforce has literally _released weights_ for their code models. You can't put the genie back in the bottle.

Actually enforcing any action when the representations are learned rather than hard-coded just seems impossible to me. They have a check box that removes any predictions matching existing code - that basically makes it impossible to discern the source since this will be based on some subjective "semantic closeness" BS.


> Believe me, Microsoft wouldn't have released this thing (after what, 6 months of beta testing?) if they thought they had any "problems" at all.

They would. Did you just forget TAI? Microsoft didn't consider 4chan would train her to be the ultimate racist.


As you mentioned “WHEN the singularity happens” as an article of religious faith, followed by a vast leap of faith in proposing no-code tools taking over programming, I’m afraid that you adapting to the reality on the ground will be the difficult part here, rather than the lack of adaptation by programming writ large.

Do you work in marketing? Do you program?


I'm only somewhat certain that the singularity is inevitable (and obviously my predictions aren't worth betting on anyways) - sorry for using poetic language.

I'm a machine learning engineer, amateur researcher and open source contributor. Before that I was a software engineer for 8 years.


There are cursive typefaces for the Latin script, e.g. [1]. There might even be some free ones. Making (good looking) fonts is hard work, but I don’t think making cursive fonts is that much harder.

[1]: https://www.dizajndesign.sk/en/font/skolske


As for some solutions: The choice of the writing instrument helps a lot. While fountain pens feel amazing to write with, one looks like a smurf after writing with one. Hard pencils are one of the better choices. Writing slower also helps a bit.

A “different” solution would be to write from right to left. I’ve tried it multiple times — both writing with mirrored symbols and writing non-mirrored symbols. The positive was that my hands were a lot cleaner. The negatives were that others couldn’t read what I wrote and that I looked as a crazy person.


Hah I took my notes in high school this way too! It came surprisingly naturally and wasn’t even that difficult for me to learn to read. Maybe I should just start taking notes backwards in business meetings, since nobody else reads them anyway.


> …then why aren't they just library functions in a scripting language?

I’d say they basically are. What’s the difference between Python’s sort and (let’s say) Bash’s sort? In Python I call sort to order a data structure like a list. In Bash I do the same, but on a different data structure. The crux of the matter is probably buried in semantics.

> I still don't see the beauty.

What I like about shell and pipelines is that they allow me to easily edit and play with text. I don’t need to think about opening nor closing files, I don’t need to think how to best represent the data I’m working with. In minutes I can get the information I need and I can move on.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: