Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm confused. What is this?


An automated parody of HN, OP explains above: https://news.ycombinator.com/item?id=10248803

> I've pulled every comment and story from HN through the API and make a bunch of Markov chains to produce story titles and comments.


Does that mean that there is no deep learning here, only statistics and randomization?


> Does that mean that there is no deep learning here, only statistics and randomization?

Correct. It is a perfect simulation of HN.


Normally I don't like comments that are just jokes, but this one was too perfect. Well done.

And to add just a slight bit more substance to my comment, while I was reading through all the comments here my wife asked why I was laughing so hard. I found it really difficult to convey why, but I guess that's the nature of this type of humor.


Markov chains do learn in the sense that they model distributions of strings, can be trained on observed strings and used to generate strings, assign probabilities to strings, classify strings, etc. They have well developed treatments in multiple frameworks of computational learning theory, including Gold learnability and PAC learnability.

Is "deep learning" more than statistics and randomization?


Nothing wrong with statistics and randomization. :)


There isn't, but I don't think it should be difficult to feed this into 'char-rnn' if you wanted to do it with RNNs rather than Markov chains. The interface, such as it is, to char-rnn is very simple; you dump everything into a text file 'input.txt'.


I'm actually running that right now with HN comments. It's not done training, but it's not that much more interesting than OP's. Here's some example output:

{"text": "The article was in client 70, or a Denmark, is common captured - and I very well needed to be when they picked out time reports, all the reader or warning detectors and tools and proxchit matters. It also comes up with all levels of me using legality as it is that.<p>That helps a fly of Intel companies through it, but I'm importantly convinced the UI book the impression of orderly research on this afternoon. Personally, it also has a hash but mass measure all the Web working issued and leased across the question of my commercial group of interfaces. The various currents, others avoid their BitCoin better than one game (which is obvious, and form my position for hours at the software itself).", "author": "nostrademons"}

{"text": "Neither care to censor other people (\"infrastructure\" type by development! Neuroscience, flying migrations.)<p>Relevant comment was not finished at the moment. If that appears to be the case that, but ones are a real body.", "author": "pavel_liah"}

{"text": "<i>But just surely this should simply escruble him critically though I laudf. </i><p>It's taken as a more extreme shark to manage hcpm-infolves. However, I'm great, laser mortality, one of the aight payment.", "author": "jacquesm"}

{"text": "FtAhn is not all for violenceral campuse. Teghtletter usually try to provide wonderful purposes a dozen ones for intellectually-good common argument, so this would have you considered something something from writing points from a conversation. It's such a good idea and prohibition, disappeared. Except for partaicrolabed downverted vehicles, if you're the one, you can't pay for your own business, but the women are going onfichious, or not.<p>Edit: the processor thinks without searching stuff. <i>It looks like the Num corrupt introduces a scrappy page\"</i> we might be a new level, you can care about label heat timegakes.<p>&#62; Two individuals. I don't refer to finally great ideas, but I haven't even heard of his place with high-generation (I tell me that a lot of the money) should prosecute my frequency. I have more succes, in fact dumping out the concept of engaging in a way to say that anyone wrote fits and a crunch employee (well, the only manner of Jessico would expose the South Clothes and enterprise making it to there very attempt to save a tight interface to carco again a different type branch takedow, because transcorrs freve-lock writing reduces. Grannian raises <i>the major</i> responsibility.)<p>(Nope! Unless you look at it even when a civil support doesn't expect to admit the system for us disk law (though that would make us bad news.)", "author": "sp332"}

{"text": "Sigh HTML5 of the extradition to NELOANAG may changed.", "author": "davidw"}

Or if I sample with low temperature:

{"text": "I don't know what I want to do in the sense that I was a problem with the same problem with the same problem as a comment on the side of the site. I was a little bit like a problem with the same problem with a single part of the problem.<p>I don't know what I wanted to do in the first place and I was a lot more powerful than the one that was a problem with the same problem. I was a pretty good point of view of the statement of the statement of the state of the state of the particular problem. I would have to say that the problem is that the problem is that the problem is that the problem is that the problem is that the problem is that the problem is that the problem is that the problem is that the problem is that the problem is that the problem is that the problem is that the problem is that the problem is that the problem is the same as a single person who would have to pay for the same problem as a program that is a problem that is a problem with the same problem. ... <then it just keeps going like that>

I'm hoping to learn individual user's styles.


> I'm hoping to learn individual user's styles.

You'll probably have to train separate char-rnn instances, unfortunately. For the past week or two, I've been experimenting with putting in a metadata prefix which I can use as a seed to specify author/ style, but thus far it hasn't worked at all. The char-rnn just spits out a sort of average text and doesn't mimic specific styles.


Yup. That's been my finding as well. char-rnn was really just a diversion of curiosity after I'd cleaned up the data. My best idea right now is to make a generative model of p(next_token | previous_token(s), author), essentially connecting author directly to every observation. I'm mostly sure that using characters as tokens is overkill for this and requires a higher complexity model than I can afford with this dataset and my computational resources, so I'm going to stop using char-rnn with it.


That's possible. My hope was that you could get authorial style by just including it inline as metadata rather than needing to hardwire it into the architecture (eg take 5 hidden nodes and have them specify an ID for the author so the RNN can't possibly forget). It would have been so convenient and made char-rnn much more useful, but I guess it's turning out to be too convenient to be true.


I see. Cool. I'm building an HN clone now. Hierarchical comments are a fun challenge. Any tips?


Postgres recursive queries make them pretty simple to deal with. This[1] is the query I used to pull a list of all a stories children recursively.

1. https://gist.github.com/orf/5565a572c6ddda039d6f


Hmmm, I'm trying to decide between this recursive query or something using ltrees.

Can't make up my mind . . .

[1] https://truongtx.me/2014/02/28/tree-structure-query-with-pos... [2] http://stackoverflow.com/questions/603894/is-postgresqls-ltr...


Yeah. If you store them in a relational database you will grow a couple of grey hairs because you're essentially forcing a tree structure into a table structure. The concepts clash and it's kind of a pain, but possible since every post has 1 unique parent, so you can make upwards references and rebuild a tree from that.


A great technique I've used for many years to store hierarchical data is by using a Nested Set Model (https://en.wikipedia.org/wiki/Nested_set_model)

It works a treat as you can query the whole tree in one SQL statement but preserve the nesting for formatting.


Thank you sir for pointing that out. This looks very interesting and I might even implement that in my own blog at some point in time. I thought about doing something similar to that (without actually knowing this technique) but I shyed away precisely because of the price you pay at insertion time.

EDIT: I've been thinking some more about this. Another possibility would be to limit the depth of the tree to, say, 8 (which should be reasonable) and then make 8 fields, one for each ancestor (parent, grandparent, and so on). Changing the tree will become a nightmare but all queries for subtrees will be blazingly fast.


Since individual comment threads are never that big, just store the root parent ID and query on that. Then reconstruct in code.


Yeah that's how I currently implemented it on my site, but it can't harm to overthink the solution of performance problems I don't (yet?) have ;)


You could probably solve more nonexistent problems with caching than by limiting thread depth.


To clarify - more users are going to read and refresh pages than actually post, so making certain not every GET request results in a new database query would probably improve performance more than trying to limit the number of rows in each query.

Query performance obviously matters, but with a HN like site, it's probably not going to be so critical that limiting the depth of threads is even worth the effort.


The "closure table" approach works best if you have frequent reads and writes, see e.g. https://coderwall.com/p/lixing/closure-tables-for-browsing-t...


A simple method (a modified adjacency list) I've used just stores the root id, parent id and id of each post together. You can get the entire tree from any root post easily (everything shares the same root id) but getting the whole subtree beyond immediate children takes recursion.

I find that you don't even have to worry about treating the data as a tree in most cases until the very end. What you want to actually deal with is a flat array with the ids (root,parent,id) arranged in rendering order, and to have the tree built in the HTML. The data set from the DB doesn't even have to represent the tree structure directly, as long as you can sort it elsewhere.

You can even have two arrays - one (say, an associative array) with the data, and another basic array with the ids. Sort just the array with the ids, then use those as keys to iterate the data array when building the html, so you can avoid ever having to sort the larger array (which as luck has it just happens to be optimized for non-linear access anyway.)

I should probably mention, I thought this was terribly clever when I did it in PHP, before I was aware that all arrays in PHP are basically the same, so it was mostly pointless overoptimization.

Building something like an unordered list in HTML from that array then becomes a matter of adding or removing <UL> elements based on the relative change in depth for each subsequent id. Depth is easy to find by checking if an item's parent is (or isn't) the same as the id of the previous element. The actual tree never exists in code until the unordered list is rendered in the browser.

Also here's a good reference on Stack Overflow of different methods to do the same thing: https://stackoverflow.com/questions/2175882/how-to-represent...

If you actually know what you're doing beforehand, probably ignore everything I just said and go with nested sets. My method is, admittedly, naive and better programmers are probably chuckling at it over the beverage of their choice, but it does work and it seems to be decently fast.


Hmmm, toooo many options. How do I decide between them. Sort of a DB noob.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: