I recently interviewed for (and got) a new job. The interview process took about...

zbyte64 · on June 12, 2015

The correct way to write a CSV parser is to import one.

chrisseaton · on June 12, 2015

Someone has to write that CSV parser though, don't they?

And sometimes you actually need to write things like sort algorithms. I implemented an insertion sort from scratch recently, because it needed to be done in a particular way to take advantage of the capabilities of a specific JIT compiler.

People do actually do this stuff - it's not all academic.

jbdigriz · on June 12, 2015

Did you have to do this on a whiteboard or paper as part of the interview process for your job? The point: not being able to do this on a interview in completely contrived circumstances with all the anxiety that comes with close scrutiny of a person who had the power to determine your future career is not a good gauge of your ability to do so when matters, given all the resources we as developers generally afford ourselves.

My biggest problem is that the vast majority of these tests is they only test two skills, neither of which is all that critical to software development beyond a certain minimum threshold: recollection and pattern matching known solutions to familiar problems. In my current level of competency with a decade of tenure in software development is this: if you're asking me to solve problems which I can easily "solve" with a few minutes of searching the web, you probably don't want or need someone like me whose spent the majority of their career on big projects which require a multitude of disciplines from being creative to quantifying results to forethought of future use to systematically testing and releasing at a minimum of risk....

What I find ironic is that I've never been tested on some of the few rote tasks which I find most developers struggle with: committing/branching/merging/commenting code, producing post release documentation, developing robust API functions, etc etc

xamuel · on June 12, 2015

Scenario: The file is a 100GB CSV file. The machine is a VPS with 500MB RAM. The task: Determine whether the value in the first column of the first row, is ever repeated in the first column of any subsequent row.

An import solution is unlikely to work. Most import solutions would try to load the rows into memory, which doesn't work here. Maybe if the imported parser can be configured to run a user-supplied callback on individual rows and then discard them...

(P.S. solving the problem with awk still counts as writing a CSV parser --- in awk)

pstrateman · on June 12, 2015

    import csv
    
    r = csv.reader(open('file.csv'))
    first_row = r.next()
    for row in r:
        if row[0] == first_row[0]:
            return True
    return False

287MB file 100m rows 15.804s peak memory usage 7MB

xamuel · on June 12, 2015

Thank you, that is beautiful. I've been proved decisively wrong and I've gained some greater appreciation for python's libraries.

TylerE · on June 12, 2015

It's not even a library thing. Reading a file in this way does reads in approximately optimal (for the FS) sized blocks. The CSV parser just rides on top of stdio.

CHY872 · on June 12, 2015

The trivial Java solution

    BufferedReader r = new BufferedReader(new File(filename))
    String line = r.readLine();
    if (line != null) {
        String first = line.split(",")[0];
        while ((line = r.readLine()) != null) {
            if (first.equals(line.split(",")[0])) {
                return true;
            }
        }
        return false;
    }
    throw new RuntimeException();

would work just fine, and very quickly, too. This sort of question simply is not hard to solve; it's far more dependent on tiny implementation details. I very much doubt that any CSV parser would try to load the file all at once; it'd be too much of a performance hit.

xamuel · on June 12, 2015

In other words, you write your own CSV parser, in this case using Java.

Btw, CSV files can contain values with commas and even newlines inside of them. So if your point was that you don't have to write an _entire_ CSV parser, only a partial one, unfortunately that isn't true. https://en.wikipedia.org/wiki/Comma-separated_values#Example

>I very much doubt that any CSV parser would try to load the file all at once

I agree, but a naive imported solution would likely try to store all the rows at once, possibly in some sort of list or vector or array or whatever. This is what would cause the memory failure, not the file read itself.

CHY872 · on June 12, 2015

Possibly, but unlikely (most csv libraries will be based around iterators). In any case, the problem is stupidly underspecified. For example, what if the 100GB CSV is 100 1GB rows? What if the input is UTF16, UTF32? How do you deal with the 10 Unicode line separators?

It just tests how well the interviewee knows CSV, which is an ill specified format anyway. It's a fake problem as no sane person would parse 100GB CSVs on 500MB VPS', and in real life, you'd just try the naive solution, see why it didn't work and iterate.

terryf · on June 12, 2015

please note that according to https://www.ietf.org/rfc/rfc4180.txt

this is a valid record definition from a csv file: "b CRLF bb","ccc" CRLF zzz,yyy,xxx

Your code will fail on this case. CSV parsing is not as simple as it sounds. Given that the format is also not well-defined, it's even worse. (is the first line a header or not for example)

TylerE · on June 12, 2015

Well, sure, and that's what I did, but the meat of the problem was the validation. E.g. couldn't assume that required fields would be present.

MrBuddyCasino · on June 12, 2015

Sure, but I think its actually a reasonable interview task. You're just being asked to implement a simple state machine.

TylerE · on June 12, 2015

No, sorry, I was unclear, using an existing parser was fair game. The meat of the problem was on the validation/actually get data into the DB side.

unclebucknasty · on June 12, 2015

>and then used whatever tools I wanted to get it done, then email in the completed exercise.

This is a much more realistic way of testing someone's skills. Asking someone to write code while they watch is not.

I am personally horrible at coding while someone is looking over my shoulder. I am just too preoccupied with their presence and the fact that they are watching. And, unless a company is still into pair programming (is anyone these days?), it's not a valid test.

Give me a real problem, reasonable time to solve it, and the tools I'd have in the real world. Then, I can show you what I can do in that same real world vs. how well I interview in some contrived format.