1. It doesn't count word frequencies, but sub-string frequencies. Moreover, if a sub-string appears more than once-per-title, then it is counted more than once. I draw this conclusion by submitting "a,b,c". And from their paper [1]:
our algorithm strips out dashes and catches any
occurrence of the query in the title, for example,
'blow' catches 'blowing', 'blowjobs'
This explains the results of these queries: "ada,erlang", "tea,beer". As an alternative they could have used a stemmer [2].
2. The "slow,fast" and "love,hardcore" trends illustrate an interesting trend. Perhaps towards women or mainstream viewers.
In my first 2 weeks of working at an adult company (as a dev yes, it's sad) one of my tasks was to watch/scan 200+ video's and describe them.. It's true, you run out of inspiration fast.
Also I could hint you: the "love" in the titles is probably explained by "love(s) to <insert profanity> ".
I don't think I ever used hardcore in a title.
I used to work with a guy who once worked as a dev at an adult company. He said it was the most soul-sucking experience of his career, and that the owners knew absolutely nothing about technology and treated their tech staff terribly.
Based on the fact that you had to spend your first two weeks doing data entry, it sounds like his experience wasn't unique.
Next: provide the porn industry a simple markov chain script to generate probabilistic porn movie titles, and save them all those incredibly tiresome brainstrom sessions they must have to create new titles :)
This reminds me of the first time I implemented a markov chain text generator. We were at a LAN party so we looked on the public network shares to find a corpus of text files to use as input.
The first thing we found was a copy of the bible and the second thing we found was someones collection of porn stories.
The start of the output was "He slipped his tongue into the lord..."
I think the interesting part about all this is how it changes over time. I have this impression that the whole area of sex is still and always has been a weird reflection on society at large. I would be curious, for example, for a much longer term graph comparing frequency of the two words 'hardcore', and 'love'.
Very interesting to see the dataset being made available. Whenever I want to do this kind of analysis, I always stumble at 'how to get the data?'. In their paper, it is mentioned that "We created a dedicated computer program to carry out the navigation and data collection tasks required to gather the metadata for all available videos...". I would love to see this program. More broadly, can anyone help me with best resources (pref python) where one can learn to crawl/scrape this type of information?
Hi, I didn't release the code of the crawler, first, because it was not well-crafted enough to be released (quick and dirty linear programming), and second, because any change in the site you crawl calls for recrafting your code.
I used python, sometimes with Beautifulsoup, sometimes with lxml, both are very good for crawling. I would say BS is easier, and LXML cleaner.