More

gwu78 · on May 25, 2017

http://www.internetsociety.org/deploy360/blog/2017/05/google...

gwu78 · on May 25, 2017

"The task is to sum the values for each key and print the key with the largest sum."

What is the smart way to do this in kdb+?

This is my naive, sloppy 15min approach.

Warning: Noob. May offend experienced k programmers.

   k)`t insert+:`k`v!("CI";"\t")0:`:tsvfile
   k)f:{select (*:k),(sum v) from t where k=x}
   k)a:f["A"]
   k)b:f["B"]
   k)c:f["C"]
   k)select k from a,b,c where v=(max v)

qesa · on May 25, 2017

Using the file from the original,

    1#desc sum each group (!/) (" II";"\t") 0: `:tsvfile

Took about 3 seconds, 2.5 of which was reading the file

EDIT:

    q)\ts d: (!/) (" II";"\t") 0: `:tsvfile
    2489 134218576
    q)\ts 1#desc sum each group d
    486 253055104

gwu78 · on May 25, 2017

I was using the first example with a char in the first column.

How to solve with only a dict?

Regarding the 1gram file at https://storage.googleapis.com/books/ngrams/books/googlebook...

This is the result I got

   3| 1742563279

using

   q)\ts d:(!/)(" II";"\t")0:`:1gram
   q)\ts 1#desc sum each group d
   1897 134218176
   371 238872864

or

   k)\ts d:(!/)(" II";"\t")0:`:1gram
   k)\ts desc:{$[99h=@x;(!x)[i]!r i:>r:. x;0h>@x;'`rank;x@>x]}
   k)\ts 1#desc (sum'=:d)
   1897 134218176
   0 3152
   372 238872864

No doubt I must be doing some things wrong.

qesa · on May 26, 2017

I actually had it wrong in mine. Wasn't paying attention and had the dictionary the wrong way around. Probably would have been more obvious with the char since you can't sum them...

With the reverse thrown in to switch the key/value around we get the correct answer

    q) 1#desc sum each group (!/) reverse (" II";"\t")0:`:1gram
    2006| 22569013

or

    k) {(&x=|/x)#x}@+/'=:!/|(" II";"\t")0:`:1gram
    (,2006i)!,22569013i

Works the same for the simple example

    k)e: 4 5 8 9 6!"ABBCA"
    k){(&x=|/x)#x}@+/'=:e
    (,"B")!,13

gwu78 · on May 21, 2017

Long before amp, Google began prefixing search result urls with "google.tld?url=" and adding Google parameters as suffixes such as "sa=", "ved=", etc.

Unless I am mistaken this parasitic cruft only serves Google, not end users.

Below is quick and dirty program to filter out the above. Replace .com with .cctld as needed.

Requirements: cc, lex

Usage:

   curl -o 1.htm https://www.google.com/search?q=xyz
   yyg < 1.htm > 2.htm
   your-ad-supported-web-browser 2.htm

To compile this I use something like

   flex -Crfa -8 -i g.l;
   cc -Wall -pipe lex.yy.c -static -o yyg;

Save text below as file g.l Then compile as above.

   %%
   [^\12\40-\176]
   \/url[?]q= 
   "http://www.google.com/gwt\/x?hl=en&amp;u=" 
   "&amp;"[^\"]* 
   %%
   main(){yylex();}
   yywrap(){}

As for amp, I read that it needs to use iframes (and Javascript). Yikes. We can easily write a program to strip out iframe targets as well as links to Javascript.

amphtml does look great in a text-only browser that does not load iframes automatically.

SomewhatLikely · on May 21, 2017

It's really annoying trying to copy and paste URLs from Google results. It also seems largely unnecessary, can't they detect clicks using javascript? I have noticed they have started doing this with links sent through Google Hangouts messages as well. I do remember a time when they weren't doing this and it was very refreshing because everyone else was.

gwu78 · on May 20, 2017

Favorite part is how he mixes the register allocator with a chain of seds.

I did something similar with a youtube downloader I wrote, using a long chain of seds. It is not as beautiful as Python but it is smaller and faster.

When I shared it with HN the youtube-dl author called it "unmaintainable". By who? I have had no problems maintaining it. :)

gwu78 · on May 20, 2017

This is my "Hacker News Reader". It converts HN to csv. (Only selected fields of interest to me.) From there it can easily be imported into kdb+. I have more reusable generalized lex techniques for other websites but HN is so simple it can be done via a braindead one-off as below.

Requirements: lex, cc

Usage:

   fetch -4o yc.htm https://news.ycombinator.com
   yc < yc.htm

To compile this I use something like

    flex -Crfa -8 -i yc.l;
    cc -Wall -pipe lex.yy.c -static -o yc;

Save the text below as yc.l then compile as above.

    #define jmp BEGIN
    #define p printf
    #define x yytext
   %s aa bb cc dd ee ff gg hh
   %s ii jj kk ll mm nn oo 
   aa "span class=\"rank\""
   bb "a href=\""
   cc score_........
   dd \>
    /* #include <time.h> */
    /* #include <util.h> */
   %%
   [^\12\40-\176]
   , p("%2c");
    /* rank (dont care) */
   {aa} jmp aa;
    /* <aa>[1-9][^<\.]* p("\n%s,",x);jmp bb; */
   <aa>[1-9][^<\.]* p("\n");jmp bb; 
    /* url */
   <bb>{bb} jmp cc;
   <cc>http[^"]* p("%s,",x);jmp dd; 
    /* title */
   <dd>{dd} jmp ee;
    /* <ee>[^><]* p("%s,",x);jmp ff; */
   <ee>[^><]* p("%s",x);jmp ff;
    /* host (omit) */
    /* points (dont care) */
   <ff>{cc} jmp gg;
   <gg>{dd} jmp hh;
    /* <hh>[1-9][^<> p]* p("%s,",x);jmp ii; */
   <hh>[1-9][^<> p]* p(",");jmp ii; 
    /* user */
   <ii>{bb} jmp jj;
   <jj>http[^"]* p("%s,",x);jmp kk;
    /* time (dont care) */
   <kk>{bb} jmp ll;
    /* <ll>http[^"]* ; */
   <ll>http[^"]* jmp mm;
    /* unix time (dont care) */
    /* <ll>[1-9][^<]* { 
    time_t t0; time_t t1; time_t t2;
    t1=time(&t0);
    t2=parsedate(x,&t1,0);
    p("%d,",t2); 
    jmp mm;
    } */
    /* item */
   <mm>{bb} jmp nn;
    /* <nn>http[^"]* p("%s,",x);jmp oo; */
   <nn>http[^"]* p("%s",x);jmp oo;
    /* comments (dont care) */
   <oo>{dd} jmp oo;
    /* <oo>[1-9d][^ <]* p("%s",x);jmp 0; */
   <oo>[1-9d][^ <]* jmp 0;
   .
   \n
   %%
   main(){ yylex();}
   yywrap(){ p("\n");}

dchest · on May 20, 2017

There's an API: https://github.com/HackerNews/API

deno · on May 20, 2017

And I thought using regex for parsing HTML was bad.

gwu78 · on May 20, 2017

"I suggest compiler writers stop viewing the x86 stack as a twisted version of a traditional 8-register set."

What if compiler writers will not fix the problem?

Write own register allocator: http://cr.yp.to/qhasm.html

Bonus: allow use of a "portable assembly language" to generate assembly, being respectful of the fact that users might not all be using computers with the same CPU architecture.

This could be like assembler with C-like operators and structure.

In addtion to facilitating portability it might also make writing in assembler a little easier.

Reminds me of Bell Labs' LIL: http://www.ultimate.com/phil/lil/tut.html

gwu78 · on May 18, 2017

Notable that he calls the "kill-switch" a "mistake". For example, Chrome does the same thing. When it starts it checks for some presumably non-existant domain name.

mistaken · on May 18, 2017

Yes, but the key difference is that chrome uses a randomly generated domain name, while the ransomware has it hardcoded.

gwu78 · on May 19, 2017

Yes, this sounds right. It has been a while since I looked at it. Is it just one name? I have a faint recollection it tried more than one.

Anyway, how is the difference significant?

A localhost cache can point at a custom root.zone. The user can make her own authoritative nameserver assignments for any given zone or domain. Zone files can contain wildcards.

Responses can also be rewritten on the fly.

The end user can exercise full control over what is and is not a "valid" domain name. She can prevent her applications from ever receiving an "NXDOMAIN" response.

Maybe I am missing something but this "test" seems brittle; it only tests ICANN DNS.

gwu78 · on May 18, 2017

Is this a "JSON Feed" from NYTimes?

Example below filters out all URLs for a specific section of the paper.

   test $# = 1 ||exec echo usage: $0 section

   curl -o 1.json https://static01.nyt.com/services/json/sectionfronts/$1/index.jsonp
   exec sed '/\"guid\" :/!d;s/\",//;s/.*\"//' 1.json

I guess SpiderBytes could be used for older articles?

Personally, I think a protocol like netstrings/bencode is better than JSON because it better respects the memory resources of the user's computer.

Every proposed protocol will have tradeoffs.

To me, RAM is sacred. I can "parse" netstrings in one pass but I have been unable to do this with a state machine for JSON. I have to arbitrarily limit the number of states or risk a crash. As easy as it is to exhaust a user's available RAM with Javascript so too can this be done with JSON. Indeed they go well together.

gwu78 · on May 13, 2017

"Lessons learnt by ransomware developers..."

If you are suggesting that developers, regardless whether they develop mobile apps or ransomware, will start relying less on DNS, I respectfully disagree.

Someone else in this thread commented how reliance on DNS makes systems "fragile". With that I strongly agree.

The same old assumptions will continue to be made, such as the one that DNS, specifcally, ICANN DNS, is always going to be used.

How to break unwanted software? Do not follow the assumptions.

For example, to break a very large quantity of shellcode change the name or location of the shell to something other than "/bin/sh".[1]

Will shellcoders switch to a "robust statistical model" instead of hard coding "/bin/sh"?

Someone once said that programmers are lazy. Was he joking?

1. Yes, I know it may also break wanted third party software. When I first edited init.c, renamed and moved sh I was seeking to learn about dependencies. I expected things to break. That was the point: an experiment. I wanted to see what would break and what would not.

kccqzy · on May 13, 2017

If you change the name or location of the shell to something other than "/bin/sh", plenty of legitimate software would break too.

Even though the POSIX standard says:

> Applications should note that the standard PATH to the shell cannot be assumed to be either /bin/sh or /usr/bin/sh, and should be determined by interrogation of the PATH returned by `getconf PATH`, ensuring that the returned pathname is an absolute pathname and not a shell built-in.

> For example, to determine the location of the standard sh utility:

command −v sh

gwu78 · on May 6, 2017

Here is a spitbol/snobol4 solution. Assumes the number of items in the set is not greater than the alphabet.

   * routine stolen from gimpel
   * algorithm from peck and schrack 

    define('p(s)t,n,c,k','p_init') :(p_end)
   p_init n = size(s)
    r = array('2:' n, 0)
    &alphabet len(n) . y
    x = array('2:' n, y)
    k = n + 1
   p_0 k = k - 1
    x[k] len(1) . s1 tab(k) . s2 = s2 s1  :s(p_0) 
    define('p(s)i,k') 
    p = s :(return)
   p k = size(s)
   p_1 s = replace(x[k],y,s) :f(p_2)
    r[k] = r[k] + 1
    k = eq(remdr(r[k], k),0) k - 1 :s(p_1)
    p = s :(return)
   p_2 define('p(s)t,n,s1,s2','p_init') :(freturn)
   p_end

   * example: all permutations of string abcdefgh
    s = 'abcdefgh'
   abc output = p(s) :s(abc)f(end)
   end

Here is another solution that only returns the unique permutations. The items of the set must first be sorted or grouped, e.g, a string like "cabcd" could be given as "ccabd", "adbcc", "abccd", etc. Duplicate items must be adjacent.

    define('r(s,ors)c,f,s1,a,d,os') 
    :(r_end)
   r ors rtab(1) len(1) . c :f(freturn)
    s (span(c) | null) . f =
    s arb . s1 len(1) . d c = :f(r_1)
    r = s1 f c d s :(return)
   r_1 ors break(c) . os
    r = r(s,os) f :s(return)f(freturn)
   r_end

    s = 'abcdefgh' 
    output = s
   x01 output = r(output,s) :s(x01)
   end