Skip to content

Parquet文件数据的读取方法

Published: at 00:00

一、前言

2010年 Google 发表了《Dremel:网络规模数据集的交互式分析》论文,2013年基于 Dremel 开源实现了 Apache Parquet,Parquet 就成为了通用的大数据文件存储格式,现今更是被广泛用于湖仓一体的架构实现上。此文档记录一下除了写程序,怎么便捷地读取 Parquet 文件数据的两种方法。

二、利用 DuckDB 在 DBeaver 上读取

1、新建一个DuckDB的连接

1.1、选择DuckDB

[Pasted image 20240111141845.png]

1.2、填写 Path

[Pasted image 20240111142021.png]

Path 填写为”:memory:“

2、执行查询

Parquet 数据文件: ~/Documents/parquet/parquet-data/c0496a45-85c7-4484-aa10-8f0c460dff0b_0-2-0_20240110235300493.parquet

2.1、查询数据

[Pasted image 20240111142427.png]

2.2、查询 Schema 信息

[Pasted image 20240111142809.png]

2.3、查询 Metadata 信息

[Pasted image 20240111142737.png]

2.4、更多

更多读写 Parquet 的操作,可以查看 DuckDB 的官方文档。 这里是直接引用 DuckDB 官方文档上的一些示例:

-- read a single Parquet file
SELECT * FROM 'test.parquet';
-- figure out which columns/types are in a Parquet file
DESCRIBE SELECT * FROM 'test.parquet';
-- create a table from a Parquet file
CREATE TABLE test AS SELECT * FROM 'test.parquet';
-- if the file does not end in ".parquet", use the read_parquet function
SELECT * FROM read_parquet('test.parq');
-- use list parameter to read 3 Parquet files and treat them as a single table
SELECT * FROM read_parquet(['file1.parquet', 'file2.parquet', 'file3.parquet']);
-- read all files that match the glob pattern
SELECT * FROM 'test/*.parquet';
-- read all files that match the glob pattern, and include a "filename" column that specifies which file each row came from
SELECT * FROM read_parquet('test/*.parquet', filename = true);
-- use a list of globs to read all Parquet files from 2 specific folders
SELECT * FROM read_parquet(['folder1/*.parquet', 'folder2/*.parquet']);
-- read over https
SELECT * FROM read_parquet('https://some.url/some_file.parquet');
-- query the metadata of a Parquet file
SELECT * FROM parquet_metadata('test.parquet');
-- query the schema of a Parquet file
SELECT * FROM parquet_schema('test.parquet');


-- write the results of a query to a Parquet file
COPY (SELECT * FROM tbl) TO 'result-snappy.parquet' (FORMAT 'parquet');
-- write the results from a query to a Parquet file with specific compression and row_group_size
COPY (FROM generate_series(100000)) TO 'test.parquet' (FORMAT 'parquet', COMPRESSION 'ZSTD', ROW_GROUP_SIZE 100000);

-- export the table contents of the entire database as parquet
EXPORT DATABASE 'target_directory' (FORMAT PARQUET);

三、利用 Parquet CLI 进行数据查询

1、安装

我的电脑是 macOS 环境,因此直接用 Brew 进行安装:

brew install parquet-cli

2、使用

2.1、查询数据
$ parquet cat ./c0496a45-85c7-4484-aa10-8f0c460dff0b_0-2-0_20240110235300493.parquet
{"_hoodie_commit_time": "20240110235300493", "_hoodie_commit_seqno": "20240110235300493_0_0", "_hoodie_record_key": "test1", "_hoodie_partition_path": "", "_hoodie_file_name": "c0496a45-85c7-4484-aa10-8f0c460dff0b_0-2-0_20240110235300493.parquet", "id": "test1", "name": "Bob", "age": 100}
2.2 查询 MetaData 信息
$ parquet meta ./c0496a45-85c7-4484-aa10-8f0c460dff0b_0-2-0_20240110235300493.parquet

File path:  ./c0496a45-85c7-4484-aa10-8f0c460dff0b_0-2-0_20240110235300493.parquet
Created by: parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94)
Properties:
  hoodie_bloom_filter_type_code: DYNAMIC_V0
    org.apache.hudi.bloomfilter: /////
    ...
    ...
    ...
    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
          hoodie_min_record_key: test1
            parquet.avro.schema: {"type":"record","name":"hudi_table_record","namespace":"hoodie.hudi_table","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"id","type":"string"},{"name":"name","type":["null","string"],"default":null},{"name":"age","type":["null","int"],"default":null}]}
              writer.model.name: avro
          hoodie_max_record_key: test1
Schema:
message hoodie.hudi_table.hudi_table_record {
  optional binary _hoodie_commit_time (STRING);
  optional binary _hoodie_commit_seqno (STRING);
  optional binary _hoodie_record_key (STRING);
  optional binary _hoodie_partition_path (STRING);
  optional binary _hoodie_file_name (STRING);
  required binary id (STRING);
  optional binary name (STRING);
  optional int32 age;
}


Row group 0:  count: 1  517.00 B records  start: 4  total(compressed): 517 B total(uncompressed):378 B
--------------------------------------------------------------------------------
                        type      encodings count     avg size   nulls   min / max
_hoodie_commit_time     BINARY    G   _     1         68.00 B    0       "20240110235300493" / "20240110235300493"
_hoodie_commit_seqno    BINARY    G   _     1         72.00 B    0       "20240110235300493_0_0" / "20240110235300493_0_0"
_hoodie_record_key      BINARY    G   _     1         56.00 B    0       "test1" / "test1"
_hoodie_partition_path  BINARY    G   _     1         50.00 B    0       "" / ""
_hoodie_file_name       BINARY    G   _     1         114.00 B   0       "c0496a45-85c7-4484-aa10-8..." / "c0496a45-85c7-4484-aa10-8..."
id                      BINARY    G   _     1         52.00 B    0       "test1" / "test1"
name                    BINARY    G   _     1         54.00 B    0       "Bob" / "Bob"
age                     INT32     G   _     1         51.00 B    0       "100" / "100"
2.3 查询 Schema 信息
$ parquet schema ./c0496a45-85c7-4484-aa10-8f0c460dff0b_0-2-0_20240110235300493.parquet
{
  "type" : "record",
  "name" : "hudi_table_record",
  "namespace" : "hoodie.hudi_table",
  "fields" : [ {
    "name" : "_hoodie_commit_time",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "_hoodie_commit_seqno",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "_hoodie_record_key",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "_hoodie_partition_path",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "_hoodie_file_name",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "id",
    "type" : "string"
  }, {
    "name" : "name",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "age",
    "type" : [ "null", "int" ],
    "default" : null
  } ]
}
2.4 查询 Pages 信息
$ parquet pages ./c0496a45-85c7-4484-aa10-8f0c460dff0b_0-2-0_20240110235300493.parquet

Column: _hoodie_commit_time
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-0    data  G _  1       27.00 B    27 B


Column: _hoodie_commit_seqno
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-0    data  G _  1       31.00 B    31 B


Column: _hoodie_record_key
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-0    data  G _  1       15.00 B    15 B


Column: _hoodie_partition_path
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-0    data  G _  1       10.00 B    10 B


Column: _hoodie_file_name
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-0    data  G _  1       78.00 B    78 B


Column: id
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-0    data  G _  1       9.00 B     9 B


Column: name
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-0    data  G _  1       13.00 B    13 B


Column: age
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-0    data  G _  1       10.00 B    10 B
$ parquet footer ./c0496a45-85c7-4484-aa10-8f0c460dff0b_0-2-0_20240110235300493.parquet
{
  "fileMetaData" : {
    "schema" : {
      "name" : "hoodie.hudi_table.hudi_table_record",
      "repetition" : "REPEATED",
      "logicalTypeAnnotation" : null,
      "id" : null,
      "fields" : [ {
        "name" : "_hoodie_commit_time",
        "repetition" : "OPTIONAL",
        "logicalTypeAnnotation" : { },
        "id" : null,
        "primitive" : "BINARY",
        "length" : 0,
        "decimalMeta" : null,
        "columnOrder" : {
          "columnOrderName" : "TYPE_DEFINED_ORDER"
        }
      }, {
        "name" : "_hoodie_commit_seqno",
        "repetition" : "OPTIONAL",
        "logicalTypeAnnotation" : { },
        "id" : null,
        "primitive" : "BINARY",
        "length" : 0,
        "decimalMeta" : null,
        "columnOrder" : {
          "columnOrderName" : "TYPE_DEFINED_ORDER"
        }
      }, {
        "name" : "_hoodie_record_key",
        "repetition" : "OPTIONAL",
        "logicalTypeAnnotation" : { },
        "id" : null,
        "primitive" : "BINARY",
        "length" : 0,
        "decimalMeta" : null,
        "columnOrder" : {
          "columnOrderName" : "TYPE_DEFINED_ORDER"
        }
      }, {
        "name" : "_hoodie_partition_path",
        "repetition" : "OPTIONAL",
        "logicalTypeAnnotation" : { },
        "id" : null,
        "primitive" : "BINARY",
        "length" : 0,

        ...
        ...
        ...

        "columnIndexReference" : {
          "offset" : 856,
          "length" : 23
        },
        "offsetIndexReference" : {
          "offset" : 958,
          "length" : 11
        },
        "bloomFilterOffset" : -1,
        "firstDataPage" : -2147483178,
        "dictionaryPageOffset" : -2147483648,
        "valueCount" : -2147483647,
        "totalSize" : -2147483597,
        "totalUncompressedSize" : -2147483615,
        "statistics" : {
          "type" : {
            "name" : "age",
            "repetition" : "OPTIONAL",
            "logicalTypeAnnotation" : null,
            "id" : null,
            "primitive" : "INT32",
            "length" : 0,
            "decimalMeta" : null,
            "columnOrder" : {
              "columnOrderName" : "TYPE_DEFINED_ORDER"
            }
          },
          "comparator" : { },
          "hasNonNullValue" : true,
          "num_nulls" : 0,
          "stringifier" : {
            "name" : "DEFAULT_STRINGIFIER",
            "digits" : "0123456789ABCDEF"
          },
          "max" : 100,
          "min" : 100
        }
      } ],
      "rowCount" : 1,
      "totalByteSize" : 378,
      "path" : null,
      "ordinal" : 0,
      "rowIndexOffset" : 0
    } ]
  }