parquet would be a lot more interesting if it could be freed from all the java/hadoop/spark baggage.
I don't want hadoop. I don't want spark. I don't want drill. I don't want presto.
I don't have big data. I do happen to have a few hundred gigabytes of compressed csv files which would likely be a lot faster to slice and dice if they were stored in a compact column format.
I can see why you'd want to keep it non-Java and simple. I will admit that I am biased since I work with Spark all day, but I think you'll eventually want to write a query where you wish you had Spark SQL or could manipulate it as a Spark RDD.
Spark actually isn't too painful for this because you can run it locally in a single process. It is not exactly as elegant as the UNIX way you outline in your second example, but it isn't as horrifying as submitting a YARN job or spinning up a cluster.
In spark-shell:
// Skip this conversion if you're just running one query. Do it if you're running many.
spark.read.json(input paths).write.parquet("/target/path")
spark.read.parquet("/target/path").createOrReplaceTempView("logs")
spark.sql("select count(distinct src) from logs where dport = 23").show()
The incremental conversion of your JSON data set to Parquet will be a little bit more annoying to write in Scala than the above example, but is very much doable. There is also a small amount of overhead with the first spark.read.parquet, but it's faster on a local data source than it is against something like S3.
I just wish there were some more cli friendly tools for the cases where the grep | cut | sort pipeline works, but you'd like to just make it a bit more efficient.
Not directly related to your question, but I suspect pigz and gnu parallel would dramatically speed up your current command-line technique, if you are on a multi-processor machine.
Also you might put export LC_ALL=C before fgrep, since you are searching for simple ascii characters.
I presume your cut is required because of other potential row conflicts, so one last thing you might test is the timings of putting cut after the fgrep instead of in front of it. This can be counter-intuitive but might make a difference for you.
If you are using sort to do your "count_distinct", try to make sure that sort's temporary files are on a different hard disk from the files you are reading.
Speaking of which, the disks you are reading off of will often be your bottleneck when processing large files through command line pipes.
> Not directly related to your question, but I suspect pigz and gnu parallel would dramatically speed up your current command-line technique, if you are on a multi-processor machine.
I was using pigz :-)
> I presume your cut is required because of other potential row conflicts, so one last thing you might test is the timings of putting cut after the fgrep instead of in front of it. This can be counter-intuitive but might make a difference for you.
The cut was actually bro-cut which is a lot faster than cut. I couldn't use grep -> cut because it would find the value I was looking for in other fields. awk '$5 == 23' or such would have worked, but awk is slower than bro-cut too.
> If you are using sort to do your "count_distinct", try to make sure that sort's temporary files are on a different hard disk from the files you are reading.
I wasn't.. The number of unique values easily fit in memory, so I used a short python program that also counted the frequency of the values.
> Speaking of which, the disks you are reading off of will often be your bottleneck when processing large files through command line pipes.
It's a gigantic freenas box over 10G ethernet, so mostly the time is spent uncompressing the data and ignoring all but 2 of the columns, which is why data stored in a columnar format would be a lot faster.
>> pushing data frames in and out of memory as simple as possible
Where do you think its pushing from and to? Parquet.
Wes McKinney (of pandas fame) is the author behind Arrow/Feather as well as parquet-cpp [1]. Arrow is a compact binary representation of columnar data for in-memory processing with the HUGE benefit that data can be copied over the network with no serialization/deserialization between processes or languages.
One of the primary advantages of Spark over MapReduce is that it doesn't require data to be copied to disk and incur costly SerDer overhead. Now imagine that not only can I avoid writing to disk between processing stages, I can now move GBs of data between nodes by simply copying data from the network stack to memory as-is.
>> Parquet is fantastic, but it has a lot of functionality that's not very relevant for sub-TB of data, local use.
Parquet files can still be on local storage, there's no dependency on HDFS or network storage. If I'm only dealing with MBs then CSV will do the trick, otherwise Parquet wins in every category when reading/writing data.
This doesn't fully answer what you're looking for, but there is a C/C++ API for working with Parquet that's not tied to JVM-Land : https://github.com/apache/parquet-cpp
Totally. I was kind of surprised there weren't really many other implementations, in Go etc. I ended up rolling my own columnar format since I don't need parity, but it would be great to have more implementations, I don't want to touch Java personally.
My employer is doing a lot of this, mostly for cost reduction. Keeping live data in ElasticSearch (or some other horizontally scalable system) is expensive. Far cheaper to keep a small portion live and the rest in S3.
Also worth mentioning that managed ES in AWS has a hard limit of 20 data nodes with 512GB of EBS. We've outgrown that and are now looking hard at alternatives like this.
I don't want hadoop. I don't want spark. I don't want drill. I don't want presto.
I don't have big data. I do happen to have a few hundred gigabytes of compressed csv files which would likely be a lot faster to slice and dice if they were stored in a compact column format.
The other day I did something like
It took about 30 minutes on a single machine, mostly from the IO/decompression overhead.All I want to do is something like