Big data is a trending topic these days and I'd like to get my hands dirty both out of curiosity and to make myself more relevant on the marketplace. That being said, I'm not sure which data sets are both interesting to play with and easily accessible. My question is:
For those of you already working with big data, what kind of data do you work with?
Fire up a VM with a single-node install on it [1] and just grab any old CSVs. Load them into HDFS, query them with Hive, query them with Impala (Drill, SparkQL, etc.). Rinse and repeat for any size of syslog data, then JSON data. Write a MapReduce job to transform the files in some way. Move on to some Spark exercises [2]. Read up on Kafka, understand how it works and think about ways to get exactly-once message delivery. Hook Kafka up to HDFS, or HBase, or a complex event processing pipeline. You'll probably need to know about serialization formats too, so study up on Avro, protobuf and Parquet (or ORCfile, as long as you understand columnar storage).
If you can talk intelligently about the whole grab bag of stuff these teams use, that'll get you in the door. Understanding RDBMSes, data warehousing concepts, and ETL is a big plus for people doing infrastructure work. If you're focused on analytics you can get away with less of the above, but knowing some of it, plus stats and BI tools (or D3 if you want to roll your own visualization) is a plus.
[1] http://www.cloudera.com/content/cloudera/en/downloads/quicks... [2] http://ampcamp.berkeley.edu/5/