一、前言
2010年 Google 发表了《Dremel:网络规模数据集的交互式分析》论文,2013年基于 Dremel 开源实现了 Apache Parquet,Parquet 就成为了通用的大数据文件存储格式,现今更是被广泛用于湖仓一体的架构实现上。此文档记录一下除了写程序,怎么便捷地读取 Parquet 文件数据的两种方法。
二、利用 DuckDB 在 DBeaver 上读取
1、新建一个DuckDB的连接
1.1、选择DuckDB
1.2、填写 Path
Path 填写为”:memory:“
2、执行查询
Parquet 数据文件: ~/Documents/parquet/parquet-data/c0496a45-85c7-4484-aa10-8f0c460dff0b_0-2-0_20240110235300493.parquet
2.1、查询数据
2.2、查询 Schema 信息
2.3、查询 Metadata 信息
2.4、更多
更多读写 Parquet 的操作,可以查看 DuckDB 的官方文档。 这里是直接引用 DuckDB 官方文档上的一些示例:
-- read a single Parquet file
SELECT * FROM 'test.parquet';
-- figure out which columns/types are in a Parquet file
DESCRIBE SELECT * FROM 'test.parquet';
-- create a table from a Parquet file
CREATE TABLE test AS SELECT * FROM 'test.parquet';
-- if the file does not end in ".parquet", use the read_parquet function
SELECT * FROM read_parquet('test.parq');
-- use list parameter to read 3 Parquet files and treat them as a single table
SELECT * FROM read_parquet(['file1.parquet', 'file2.parquet', 'file3.parquet']);
-- read all files that match the glob pattern
SELECT * FROM 'test/*.parquet';
-- read all files that match the glob pattern, and include a "filename" column that specifies which file each row came from
SELECT * FROM read_parquet('test/*.parquet', filename = true);
-- use a list of globs to read all Parquet files from 2 specific folders
SELECT * FROM read_parquet(['folder1/*.parquet', 'folder2/*.parquet']);
-- read over https
SELECT * FROM read_parquet('https://some.url/some_file.parquet');
-- query the metadata of a Parquet file
SELECT * FROM parquet_metadata('test.parquet');
-- query the schema of a Parquet file
SELECT * FROM parquet_schema('test.parquet');
-- write the results of a query to a Parquet file
COPY (SELECT * FROM tbl) TO 'result-snappy.parquet' (FORMAT 'parquet');
-- write the results from a query to a Parquet file with specific compression and row_group_size
COPY (FROM generate_series(100000)) TO 'test.parquet' (FORMAT 'parquet', COMPRESSION 'ZSTD', ROW_GROUP_SIZE 100000);
-- export the table contents of the entire database as parquet
EXPORT DATABASE 'target_directory' (FORMAT PARQUET);
三、利用 Parquet CLI 进行数据查询
1、安装
我的电脑是 macOS 环境,因此直接用 Brew 进行安装:
brew install parquet-cli
2、使用
2.1、查询数据
$ parquet cat ./c0496a45-85c7-4484-aa10-8f0c460dff0b_0-2-0_20240110235300493.parquet
{"_hoodie_commit_time": "20240110235300493", "_hoodie_commit_seqno": "20240110235300493_0_0", "_hoodie_record_key": "test1", "_hoodie_partition_path": "", "_hoodie_file_name": "c0496a45-85c7-4484-aa10-8f0c460dff0b_0-2-0_20240110235300493.parquet", "id": "test1", "name": "Bob", "age": 100}
2.2 查询 MetaData 信息
$ parquet meta ./c0496a45-85c7-4484-aa10-8f0c460dff0b_0-2-0_20240110235300493.parquet
File path: ./c0496a45-85c7-4484-aa10-8f0c460dff0b_0-2-0_20240110235300493.parquet
Created by: parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94)
Properties:
hoodie_bloom_filter_type_code: DYNAMIC_V0
org.apache.hudi.bloomfilter: /////
...
...
...
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
hoodie_min_record_key: test1
parquet.avro.schema: {"type":"record","name":"hudi_table_record","namespace":"hoodie.hudi_table","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"id","type":"string"},{"name":"name","type":["null","string"],"default":null},{"name":"age","type":["null","int"],"default":null}]}
writer.model.name: avro
hoodie_max_record_key: test1
Schema:
message hoodie.hudi_table.hudi_table_record {
optional binary _hoodie_commit_time (STRING);
optional binary _hoodie_commit_seqno (STRING);
optional binary _hoodie_record_key (STRING);
optional binary _hoodie_partition_path (STRING);
optional binary _hoodie_file_name (STRING);
required binary id (STRING);
optional binary name (STRING);
optional int32 age;
}
Row group 0: count: 1 517.00 B records start: 4 total(compressed): 517 B total(uncompressed):378 B
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
_hoodie_commit_time BINARY G _ 1 68.00 B 0 "20240110235300493" / "20240110235300493"
_hoodie_commit_seqno BINARY G _ 1 72.00 B 0 "20240110235300493_0_0" / "20240110235300493_0_0"
_hoodie_record_key BINARY G _ 1 56.00 B 0 "test1" / "test1"
_hoodie_partition_path BINARY G _ 1 50.00 B 0 "" / ""
_hoodie_file_name BINARY G _ 1 114.00 B 0 "c0496a45-85c7-4484-aa10-8..." / "c0496a45-85c7-4484-aa10-8..."
id BINARY G _ 1 52.00 B 0 "test1" / "test1"
name BINARY G _ 1 54.00 B 0 "Bob" / "Bob"
age INT32 G _ 1 51.00 B 0 "100" / "100"
2.3 查询 Schema 信息
$ parquet schema ./c0496a45-85c7-4484-aa10-8f0c460dff0b_0-2-0_20240110235300493.parquet
{
"type" : "record",
"name" : "hudi_table_record",
"namespace" : "hoodie.hudi_table",
"fields" : [ {
"name" : "_hoodie_commit_time",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_commit_seqno",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_record_key",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_partition_path",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_file_name",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "id",
"type" : "string"
}, {
"name" : "name",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "age",
"type" : [ "null", "int" ],
"default" : null
} ]
}
2.4 查询 Pages 信息
$ parquet pages ./c0496a45-85c7-4484-aa10-8f0c460dff0b_0-2-0_20240110235300493.parquet
Column: _hoodie_commit_time
--------------------------------------------------------------------------------
page type enc count avg size size rows nulls min / max
0-0 data G _ 1 27.00 B 27 B
Column: _hoodie_commit_seqno
--------------------------------------------------------------------------------
page type enc count avg size size rows nulls min / max
0-0 data G _ 1 31.00 B 31 B
Column: _hoodie_record_key
--------------------------------------------------------------------------------
page type enc count avg size size rows nulls min / max
0-0 data G _ 1 15.00 B 15 B
Column: _hoodie_partition_path
--------------------------------------------------------------------------------
page type enc count avg size size rows nulls min / max
0-0 data G _ 1 10.00 B 10 B
Column: _hoodie_file_name
--------------------------------------------------------------------------------
page type enc count avg size size rows nulls min / max
0-0 data G _ 1 78.00 B 78 B
Column: id
--------------------------------------------------------------------------------
page type enc count avg size size rows nulls min / max
0-0 data G _ 1 9.00 B 9 B
Column: name
--------------------------------------------------------------------------------
page type enc count avg size size rows nulls min / max
0-0 data G _ 1 13.00 B 13 B
Column: age
--------------------------------------------------------------------------------
page type enc count avg size size rows nulls min / max
0-0 data G _ 1 10.00 B 10 B
2.5 查询 Footer 信息
$ parquet footer ./c0496a45-85c7-4484-aa10-8f0c460dff0b_0-2-0_20240110235300493.parquet
{
"fileMetaData" : {
"schema" : {
"name" : "hoodie.hudi_table.hudi_table_record",
"repetition" : "REPEATED",
"logicalTypeAnnotation" : null,
"id" : null,
"fields" : [ {
"name" : "_hoodie_commit_time",
"repetition" : "OPTIONAL",
"logicalTypeAnnotation" : { },
"id" : null,
"primitive" : "BINARY",
"length" : 0,
"decimalMeta" : null,
"columnOrder" : {
"columnOrderName" : "TYPE_DEFINED_ORDER"
}
}, {
"name" : "_hoodie_commit_seqno",
"repetition" : "OPTIONAL",
"logicalTypeAnnotation" : { },
"id" : null,
"primitive" : "BINARY",
"length" : 0,
"decimalMeta" : null,
"columnOrder" : {
"columnOrderName" : "TYPE_DEFINED_ORDER"
}
}, {
"name" : "_hoodie_record_key",
"repetition" : "OPTIONAL",
"logicalTypeAnnotation" : { },
"id" : null,
"primitive" : "BINARY",
"length" : 0,
"decimalMeta" : null,
"columnOrder" : {
"columnOrderName" : "TYPE_DEFINED_ORDER"
}
}, {
"name" : "_hoodie_partition_path",
"repetition" : "OPTIONAL",
"logicalTypeAnnotation" : { },
"id" : null,
"primitive" : "BINARY",
"length" : 0,
...
...
...
"columnIndexReference" : {
"offset" : 856,
"length" : 23
},
"offsetIndexReference" : {
"offset" : 958,
"length" : 11
},
"bloomFilterOffset" : -1,
"firstDataPage" : -2147483178,
"dictionaryPageOffset" : -2147483648,
"valueCount" : -2147483647,
"totalSize" : -2147483597,
"totalUncompressedSize" : -2147483615,
"statistics" : {
"type" : {
"name" : "age",
"repetition" : "OPTIONAL",
"logicalTypeAnnotation" : null,
"id" : null,
"primitive" : "INT32",
"length" : 0,
"decimalMeta" : null,
"columnOrder" : {
"columnOrderName" : "TYPE_DEFINED_ORDER"
}
},
"comparator" : { },
"hasNonNullValue" : true,
"num_nulls" : 0,
"stringifier" : {
"name" : "DEFAULT_STRINGIFIER",
"digits" : "0123456789ABCDEF"
},
"max" : 100,
"min" : 100
}
} ],
"rowCount" : 1,
"totalByteSize" : 378,
"path" : null,
"ordinal" : 0,
"rowIndexOffset" : 0
} ]
}