Crate parquet[−][src]
Apache Parquet is a columnar storage format that provides efficient data compression and encoding schemes to improve performance of handling complex nested data structures. Parquet implements record-shredding and assembly algorithm described in the Dremel paper.
Crate provides API to access file schema and metadata from a Parquet file, extract row groups or column chunks from a file, and read records/values.
Usage
This crate is on crates.io and can be used by
adding parquet to the list of dependencies in Cargo.toml.
[dependencies]
parquet = "0.2"
and this to the project's crate root:
extern crate parquet;
Example
Below is the example of reading a Parquet file, listing Parquet metadata including column chunk metadata, using record API and accessing row group readers.
use std::fs::File; use std::path::Path; use parquet::file::reader::{FileReader, SerializedFileReader}; // Creating a file reader let path = Path::new("data/alltypes_plain.parquet"); let file = File::open(&path).expect("File should exist"); let reader = SerializedFileReader::new(file).expect("Valid Parquet file"); // Listing Parquet metadata let parquet_metadata = reader.metadata(); let file_metadata = parquet_metadata.file_metadata(); for i in 0..parquet_metadata.num_row_groups() { // Accessing row group metadata let row_group_metadata = parquet_metadata.row_group(i); // Accessing column chunk metadata for j in 0..row_group_metadata.num_columns() { let column_chunk_metadata = row_group_metadata.column(j); } } // Reading data using record API let mut iter = reader.get_row_iter(None).expect("Should be okay"); while let Some(record) = iter.next() { // do something with the record... println!("{}", record); } // Accessing row group readers in a file for i in 0..reader.num_row_groups() { let row_group_reader = reader.get_row_group(i).expect("Should be okay"); }
Metadata
Module metadata contains Parquet metadata structs, including
file metadata, that has information about file schema, version, and number of rows,
row group metadata with a set of column chunks that contain column type and encodings,
number of values and compressed/uncompressed size in bytes.
Schema and type
Parquet schema can be extracted from FileMetaData
and is represented by Parquet type.
Parquet type is described by Type, including top level
message type or schema. Refer to the schema module for the detailed information
on Type API, printing and parsing of message types.
File and row group API
Module file contains all definitions to explore Parquet files metadata and data.
File reader FileReader is a starting point for
working with Parquet files - it provides set of methods to get file metadata, row
group readers RowGroupReader to get access to
column readers and record iterator.
Read API
Crate offers several methods to read data from a Parquet file:
Modules
| basic |
Contains Rust mappings for Thrift definition.
Refer to |
| column |
Low level column reader API. |
| compression |
Contains codec interface and supported codec implementations. |
| data_type |
Data types that connect Parquet physical types with their Rust-specific representations. |
| decoding |
Contains all supported decoders for Parquet. |
| encoding |
Contains all supported encoders for Parquet. |
| errors |
Common Parquet errors and macros. |
| file |
Main entrypoint for working with Parquet API. Provides access to file and row group readers, record API, etc. |
| memory |
Utility methods and structs for working with memory. |
| record |
Contains record-based API for reading Parquet files. |
| schema |
Parquet schema definitions and methods to print and parse schema. |