Indexing JSON logs with Parquet

justinsaccount · on Dec 27, 2016

parquet would be a lot more interesting if it could be freed from all the java/hadoop/spark baggage.

I don't want hadoop. I don't want spark. I don't want drill. I don't want presto.

I don't have big data. I do happen to have a few hundred gigabytes of compressed csv files which would likely be a lot faster to slice and dice if they were stored in a compact column format.

The other day I did something like

  $ zcat logs/*.gz | cut -f 3,5 | fgrep -w 23 | count_distinct

It took about 30 minutes on a single machine, mostly from the IO/decompression overhead.

All I want to do is something like

  $ zcat logs/*.json.gz | json2parquet parquet_logs
  $ parquet-filter --output-fields src dport=23 parquet_logs | count_distinct

zten · on Dec 28, 2016

I can see why you'd want to keep it non-Java and simple. I will admit that I am biased since I work with Spark all day, but I think you'll eventually want to write a query where you wish you had Spark SQL or could manipulate it as a Spark RDD.

Spark actually isn't too painful for this because you can run it locally in a single process. It is not exactly as elegant as the UNIX way you outline in your second example, but it isn't as horrifying as submitting a YARN job or spinning up a cluster.

In spark-shell:

    // Skip this conversion if you're just running one query. Do it if you're running many.
    spark.read.json(input paths).write.parquet("/target/path")  
    spark.read.parquet("/target/path").createOrReplaceTempView("logs")
    spark.sql("select count(distinct src) from logs where dport = 23").show()

The incremental conversion of your JSON data set to Parquet will be a little bit more annoying to write in Scala than the above example, but is very much doable. There is also a small amount of overhead with the first spark.read.parquet, but it's faster on a local data source than it is against something like S3.

justinsaccount · on Dec 28, 2016

Yeah.. I agree. It's not really that I don't want to use spark, it's more that I don't want to have to use spark just to convert some data files.

If I can write a short tool that wraps the

  spark.read.json(input paths).write.parquet("/target/path")

As

  spark.read.json(argv[1]).write.parquet(argv[2])

and then use it like

  json_to_parquet /logs/json/2016-12-28.gz /logs/parquet/2016-12-28

or something, that would work.

I just wish there were some more cli friendly tools for the cases where the grep | cut | sort pipeline works, but you'd like to just make it a bit more efficient.

mhuffman · on Dec 27, 2016

Not directly related to your question, but I suspect pigz and gnu parallel would dramatically speed up your current command-line technique, if you are on a multi-processor machine.

Also you might put export LC_ALL=C before fgrep, since you are searching for simple ascii characters.

I presume your cut is required because of other potential row conflicts, so one last thing you might test is the timings of putting cut after the fgrep instead of in front of it. This can be counter-intuitive but might make a difference for you.

If you are using sort to do your "count_distinct", try to make sure that sort's temporary files are on a different hard disk from the files you are reading.

Speaking of which, the disks you are reading off of will often be your bottleneck when processing large files through command line pipes.

http://zlib.net/pigz/

https://www.gnu.org/software/parallel/parallel_tutorial.html

justinsaccount · on Dec 27, 2016

> Not directly related to your question, but I suspect pigz and gnu parallel would dramatically speed up your current command-line technique, if you are on a multi-processor machine.

I was using pigz :-)

> I presume your cut is required because of other potential row conflicts, so one last thing you might test is the timings of putting cut after the fgrep instead of in front of it. This can be counter-intuitive but might make a difference for you.

The cut was actually bro-cut which is a lot faster than cut. I couldn't use grep -> cut because it would find the value I was looking for in other fields. awk '$5 == 23' or such would have worked, but awk is slower than bro-cut too.

> If you are using sort to do your "count_distinct", try to make sure that sort's temporary files are on a different hard disk from the files you are reading.

I wasn't.. The number of unique values easily fit in memory, so I used a short python program that also counted the frequency of the values.

> Speaking of which, the disks you are reading off of will often be your bottleneck when processing large files through command line pipes.

It's a gigantic freenas box over 10G ethernet, so mostly the time is spent uncompressing the data and ignoring all but 2 of the columns, which is why data stored in a columnar format would be a lot faster.

EwanToo · on Dec 27, 2016

Check out Apache Arrow and Feather for a "local" equivalent of Parquet

https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-dis...

tomnipotent · on Dec 28, 2016

Not sure what you mean by "local". Arrow is in-memory, Parquet is on-disk. They are complimentary techs.

EwanToo · on Dec 28, 2016

But Feather is, as it describes itself, all about making:

"pushing data frames in and out of memory as simple as possible"

Parquet is fantastic, but it has a lot of functionality that's not very relevant for sub-TB of data, local use.

tomnipotent · on Dec 29, 2016

>> pushing data frames in and out of memory as simple as possible

Where do you think its pushing from and to? Parquet.

Wes McKinney (of pandas fame) is the author behind Arrow/Feather as well as parquet-cpp [1]. Arrow is a compact binary representation of columnar data for in-memory processing with the HUGE benefit that data can be copied over the network with no serialization/deserialization between processes or languages.

One of the primary advantages of Spark over MapReduce is that it doesn't require data to be copied to disk and incur costly SerDer overhead. Now imagine that not only can I avoid writing to disk between processing stages, I can now move GBs of data between nodes by simply copying data from the network stack to memory as-is.

>> Parquet is fantastic, but it has a lot of functionality that's not very relevant for sub-TB of data, local use.

Parquet files can still be on local storage, there's no dependency on HDFS or network storage. If I'm only dealing with MBs then CSV will do the trick, otherwise Parquet wins in every category when reading/writing data.

[1] https://github.com/apache/parquet-cpp

cldellow · on Dec 27, 2016

Depending on your needs, consider trying out lz4!

vs gzip, we found that lz4 decompresses 5x as quickly while maintaining comparable compression ratios and times.

glogla · on Dec 28, 2016

Yep. A lz4 is so fast on decompression that if you're reading from anything slower than RAM, you won't even notice the decompression.

framebit · on Dec 27, 2016

This doesn't fully answer what you're looking for, but there is a C/C++ API for working with Parquet that's not tied to JVM-Land : https://github.com/apache/parquet-cpp

justinsaccount · on Dec 27, 2016

I've been keeping an eye on it, the last time I looked at it only reading was implemented.

squarecog · on Dec 27, 2016

You may be interested in what Wes McKinney has to say on the topic: http://wesmckinney.com/blog/outlook-for-2017/ (you can read/write using pyarrow)

zitterbewegung · on Jan 1, 2017

Hey, I started working on a project that attempts to solve this problem. Its at https://github.com/zitterbewegung/dataknife

lsb · on Dec 27, 2016

For a few hundred gigs I've found SQLite is a very efficient storage format, plus b-trees and declarative data access for free.

vistarchris · on Dec 27, 2016

I've never tried it myself, but there is this standlone JAR for dumping columns from a parquet file:

https://github.com/Parquet/parquet-mr/tree/master/parquet-to...

tjholowaychuk · on Dec 28, 2016

Totally. I was kind of surprised there weren't really many other implementations, in Go etc. I ended up rolling my own columnar format since I don't need parity, but it would be great to have more implementations, I don't want to touch Java personally.

whatnotests · on Dec 27, 2016

I'm curious about where the threshold lies between "ELK Stack is good enough" and "We need Parquet".

coredog64 · on Dec 27, 2016

My employer is doing a lot of this, mostly for cost reduction. Keeping live data in ElasticSearch (or some other horizontally scalable system) is expensive. Far cheaper to keep a small portion live and the rest in S3.

Also worth mentioning that managed ES in AWS has a hard limit of 20 data nodes with 512GB of EBS. We've outgrown that and are now looking hard at alternatives like this.