Hacker News new | past | comments | ask | show | jobs | submit login
Indexing JSON logs with Parquet (vistarmedia.com)
60 points by vistarchris on Dec 27, 2016 | hide | past | favorite | 20 comments



parquet would be a lot more interesting if it could be freed from all the java/hadoop/spark baggage.

I don't want hadoop. I don't want spark. I don't want drill. I don't want presto.

I don't have big data. I do happen to have a few hundred gigabytes of compressed csv files which would likely be a lot faster to slice and dice if they were stored in a compact column format.

The other day I did something like

  $ zcat logs/*.gz | cut -f 3,5 | fgrep -w 23 | count_distinct
It took about 30 minutes on a single machine, mostly from the IO/decompression overhead.

All I want to do is something like

  $ zcat logs/*.json.gz | json2parquet parquet_logs
  $ parquet-filter --output-fields src dport=23 parquet_logs | count_distinct


I can see why you'd want to keep it non-Java and simple. I will admit that I am biased since I work with Spark all day, but I think you'll eventually want to write a query where you wish you had Spark SQL or could manipulate it as a Spark RDD.

Spark actually isn't too painful for this because you can run it locally in a single process. It is not exactly as elegant as the UNIX way you outline in your second example, but it isn't as horrifying as submitting a YARN job or spinning up a cluster.

In spark-shell:

    // Skip this conversion if you're just running one query. Do it if you're running many.
    spark.read.json(input paths).write.parquet("/target/path")  
    spark.read.parquet("/target/path").createOrReplaceTempView("logs")
    spark.sql("select count(distinct src) from logs where dport = 23").show()
The incremental conversion of your JSON data set to Parquet will be a little bit more annoying to write in Scala than the above example, but is very much doable. There is also a small amount of overhead with the first spark.read.parquet, but it's faster on a local data source than it is against something like S3.


Yeah.. I agree. It's not really that I don't want to use spark, it's more that I don't want to have to use spark just to convert some data files.

If I can write a short tool that wraps the

  spark.read.json(input paths).write.parquet("/target/path")
As

  spark.read.json(argv[1]).write.parquet(argv[2])
and then use it like

  json_to_parquet /logs/json/2016-12-28.gz /logs/parquet/2016-12-28
or something, that would work.

I just wish there were some more cli friendly tools for the cases where the grep | cut | sort pipeline works, but you'd like to just make it a bit more efficient.


Not directly related to your question, but I suspect pigz and gnu parallel would dramatically speed up your current command-line technique, if you are on a multi-processor machine.

Also you might put export LC_ALL=C before fgrep, since you are searching for simple ascii characters.

I presume your cut is required because of other potential row conflicts, so one last thing you might test is the timings of putting cut after the fgrep instead of in front of it. This can be counter-intuitive but might make a difference for you.

If you are using sort to do your "count_distinct", try to make sure that sort's temporary files are on a different hard disk from the files you are reading.

Speaking of which, the disks you are reading off of will often be your bottleneck when processing large files through command line pipes.

http://zlib.net/pigz/

https://www.gnu.org/software/parallel/parallel_tutorial.html


> Not directly related to your question, but I suspect pigz and gnu parallel would dramatically speed up your current command-line technique, if you are on a multi-processor machine.

I was using pigz :-)

> I presume your cut is required because of other potential row conflicts, so one last thing you might test is the timings of putting cut after the fgrep instead of in front of it. This can be counter-intuitive but might make a difference for you.

The cut was actually bro-cut which is a lot faster than cut. I couldn't use grep -> cut because it would find the value I was looking for in other fields. awk '$5 == 23' or such would have worked, but awk is slower than bro-cut too.

> If you are using sort to do your "count_distinct", try to make sure that sort's temporary files are on a different hard disk from the files you are reading.

I wasn't.. The number of unique values easily fit in memory, so I used a short python program that also counted the frequency of the values.

> Speaking of which, the disks you are reading off of will often be your bottleneck when processing large files through command line pipes.

It's a gigantic freenas box over 10G ethernet, so mostly the time is spent uncompressing the data and ignoring all but 2 of the columns, which is why data stored in a columnar format would be a lot faster.


Check out Apache Arrow and Feather for a "local" equivalent of Parquet

https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-dis...


Not sure what you mean by "local". Arrow is in-memory, Parquet is on-disk. They are complimentary techs.


But Feather is, as it describes itself, all about making:

"pushing data frames in and out of memory as simple as possible"

Parquet is fantastic, but it has a lot of functionality that's not very relevant for sub-TB of data, local use.


>> pushing data frames in and out of memory as simple as possible

Where do you think its pushing from and to? Parquet.

Wes McKinney (of pandas fame) is the author behind Arrow/Feather as well as parquet-cpp [1]. Arrow is a compact binary representation of columnar data for in-memory processing with the HUGE benefit that data can be copied over the network with no serialization/deserialization between processes or languages.

One of the primary advantages of Spark over MapReduce is that it doesn't require data to be copied to disk and incur costly SerDer overhead. Now imagine that not only can I avoid writing to disk between processing stages, I can now move GBs of data between nodes by simply copying data from the network stack to memory as-is.

>> Parquet is fantastic, but it has a lot of functionality that's not very relevant for sub-TB of data, local use.

Parquet files can still be on local storage, there's no dependency on HDFS or network storage. If I'm only dealing with MBs then CSV will do the trick, otherwise Parquet wins in every category when reading/writing data.

[1] https://github.com/apache/parquet-cpp


Depending on your needs, consider trying out lz4!

vs gzip, we found that lz4 decompresses 5x as quickly while maintaining comparable compression ratios and times.


Yep. A lz4 is so fast on decompression that if you're reading from anything slower than RAM, you won't even notice the decompression.


This doesn't fully answer what you're looking for, but there is a C/C++ API for working with Parquet that's not tied to JVM-Land : https://github.com/apache/parquet-cpp


I've been keeping an eye on it, the last time I looked at it only reading was implemented.


You may be interested in what Wes McKinney has to say on the topic: http://wesmckinney.com/blog/outlook-for-2017/ (you can read/write using pyarrow)


Hey, I started working on a project that attempts to solve this problem. Its at https://github.com/zitterbewegung/dataknife


For a few hundred gigs I've found SQLite is a very efficient storage format, plus b-trees and declarative data access for free.


I've never tried it myself, but there is this standlone JAR for dumping columns from a parquet file:

https://github.com/Parquet/parquet-mr/tree/master/parquet-to...


Totally. I was kind of surprised there weren't really many other implementations, in Go etc. I ended up rolling my own columnar format since I don't need parity, but it would be great to have more implementations, I don't want to touch Java personally.


I'm curious about where the threshold lies between "ELK Stack is good enough" and "We need Parquet".


My employer is doing a lot of this, mostly for cost reduction. Keeping live data in ElasticSearch (or some other horizontally scalable system) is expensive. Far cheaper to keep a small portion live and the rest in S3.

Also worth mentioning that managed ES in AWS has a hard limit of 20 data nodes with 512GB of EBS. We've outgrown that and are now looking hard at alternatives like this.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: