At a glance it seems correct, but there's a lot of inefficiencies, which might o...

commandlinefan · on July 19, 2024

> there's a lot of inefficiencies, which might or might not be acceptable

This is exactly what irritates us about these questions. There's no possible answer that will ever be correct "enough".

valicord · on July 21, 2024

Just like in real life, there's no perfect solution to most problems, only different trade-offs.

lucb1e · on July 18, 2024

Thanks for the feedback!

I realized at least the double-calling of line.split while writing the second instance, but figured I'm in an interview (not a take-home where you polish it before handing in) and this is more about getting a working solution (fairly quickly, since there are more questions and topics and most interviews are 1h) and from there the interviewer will steer towards what issues they care about. But then I never had to do live coding in an interview, so perhaps I'm wrong? Or overoptimizing what would take a handful of seconds to improve

That only ever one user path will hit length==3 at a time is an insight I hadn't realized, that's from minor point #3 but I guess it also shows up in major points #2 and #3 because it means you can design the whole thing differently -- each user having a rolling buffer of 3 elements and a pointer, perhaps. (I guess this is the sort of conversation to have with the interviewer)

Defaultdict, yeah I know of it, I don't remember the API by heart so I don't use it. Not sure the advantage is worth it but yep it would look cleaner

Got curious about the performance now. Downloading 1M lines of my web server logs and formatting it so that IPaddr=user and URI=page (size is now 65MB), the code runs in 3.1 seconds. I'm not displeased with 322k lines/sec for a quick/naive solution in cpython I must say. One might argue that for an average webshop, more engineering time would just be wasted :) but of course a better solution would be better

Finally, I was going to ask what you meant with major point #1 since the task does say top 3 but then I read it one more time and...... right. I should have seen that!

As for that major point though, would you rather see a solution that does not scale to N results? Like, now it can give the top 3 paths but also the top N, whereas a faster solution that keeps a separate variable for the top entry cannot do that (or it needs to keep a list, but then there's more complexity and more O(n) operations). I'm not sure I agree that sorting is not a valid trade-off given the information at hand, that is, not having specified it needs to work realtime on a billion rows, for example. (Checking just now to quantify the time it takes: sorting is about 5% of the time on this 1M lines data sample.)

For anyone curious, the top results from my access logs are

   / -> / -> / with a count of 6120
   /robots.txt -> /robots.txt -> /robots.txt with a count of 4459
   / -> /404.html -> / with a count of 4300

valicord · on July 18, 2024

> As for that major point though, would you rather see a solution that does not scale to N results? Like, now it can give the top 3 paths but also the top N, whereas a faster solution that keeps a separate variable for the top entry cannot do that (or it needs to keep a list, but then there's more complexity and more O(n) operations). I'm not sure I agree that sorting is not a valid trade-off given the information at hand, that is, not having specified it needs to work realtime on a billion rows, for example. (Checking just now to quantify the time it takes: sorting is about 5% of the time on this 1M lines data sample.)

You need the list regardless, just do `max` instead of `sort` at the end, which is O(N) rather than O(N log N). Likewise, returning top 3 elements can still be done in O(N) without sorting (with heapq.nlargest or similar), although I agree that you probably shouldn't expect most interviewees to know about this.

As for the rest, as I've said, it depends on the candidate level. From a junior it's fine as-is, although I'd still want them to be able to fix at least some of those issues once I point them out. I'd expect a senior to be able to write a cleaner solution on their own, or at most with minimal prompting (eg "Can you optimize this?")

FYI, defaultdict and setdefault is not the same thing.

  d = defaultdict(list)
  d[key].append(value)

vs

  d = {}
  d.setdefault(key, []).append(value)

useful when you only want the "default" behavior in one piece of code but not others

  >   / -> / -> / with a count of 6120
  >   /robots.txt -> /robots.txt -> /robots.txt with a count of 4459

LOL