![]() ![]() * offset The offset into input where the data starts. * * input The byte containing the raw data to be compressed. (This is possible because the metadata is written after all the blocks have been written, so the writer can retain the block boundary positions in memory until the file is closed.) Therefore, Parquet files are splittable, since the blocks can be located after reading the footer and can then be processed in parallel (by MapReduce, for example).Boolean evaluate(Tuple tuple, ImmutableBytesWritable ptr) to * actually write the frame. Unlike sequence files and Avro datafiles, where the metadata is stored in the header and sync markers are used to separate blocks, Parquet files don’t need sync markers since the block boundaries are stored in the footer metadata. Snappy has the following properties: Fast: Compression speeds at 250 MB/sec and beyond, with no assembler code. The consequence of storing the metadata in the footer is that reading a Parquet file requires an initial seek to the end of the file (minus 8 bytes) to read the footer metadata length, then a second seek backward by that length to read the footer metadata. Snappy would compress Parquet row groups making Parquet file splittable.Įxcellent Tom White's book Hadoop: The Definitive Guide, 4th Edition also confirms this: Property defines Parquet file block size (row group size) and normally would be the same as HDFS block size. Parquet stores rows and columns in so called Row groups and you can think of them as above-mentioned containers: My only question was why they did not mention their favorite Parquet format? I had to dig further to see if Parquet/Snappy combo is indeed splittable. Splittability is not relevant to HBase data. Snappy is intended to be used with a container format, like SequenceFiles or Avro data files, rather than being used directly on plain text, for example, since the latter is not splittable and cannot be processed in parallel using MapReduce. Snappy and GZip blocks are not splittable, but files with Snappy blocks inside a container file format such as SequenceFile or Avro can be split. The recent version of CDH documention fortunately delivers a better message ( link):įor MapReduce, if you need your compressed data to be splittable, BZip2 and LZO formats can be split. Snappy.NET is a P/Invoke wrapper around native Snappy, which additionally. Hortonworks docs were even more vague on a subject. Snappy is an extremely fast compressor (250MB/s) and decompressor (500MB/s). This is when I started looking frantically for an answer and ended up spending hours.Įarlier versions of Cloudera documentation were plainly wrong stating that Snappy is Splittable and we know it is not. ![]() It means that if HDFS file has more than one block, map/reduce jobs would have to decompress the entire file first (all the blocks) and only one core can do it at the same time hurting parallelism a lot. And if you pay attention, you quickly notice that Snappy IS NOT splittable and next thing you read this is a really bad thing. By the way I do not believe "Splittable" is an actual English word. Once you figure out that Snappy is a way to go and learn about how to tweak the settings for intermediate and output compression, you will stumble upon a notion of a codec being "Splittable" or not. In my tests (and your mileage will vary), Snappy reduced my Parquet files by 2 times at least while improving job processing time by 10-20%. It is still a very good idea to use Snappy compression though. If you've read about Parquet format, you learn that Parquet is already some cool smart compression and encoding of your data by employing delta encoding, run-length encoding, dictionary encoding etc. The downside of course is that it does not compress that well as gzip or bzip2. Bzip2s decompression speed is faster than its compression speed, but it is still slower than the other formats. Snappy is designed for speed and it does not load hard your CPU cores. There are tons of posts on the web if you want to get more details about various codecs and you will find that both Cloudera and Hortonworks recommend Snappy. INSTALL Makefile.am NEWS README README.md autogen.sh clearcap.map cmakeconfig.h.in comment-43-format.c configure.ac crc32.c crc32.h crc32sse42.c framing-format.c framing2-format. The short answer is yes, if you compress Parquet files with Snappy they are indeed splittableįirst off, why should you even care about compression? A typical Hadoop job is IO bound, not CPU bound, so a light and fast compression codec will actually improve performance. I spent a few hours trying to find a definite answer on this question and hopefully my post will save someone time and trouble. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |