Spark SQL JSON数据的解析

!文章内容如有错误或排版问题，请提交反馈，非常感谢！

公司记录日志的时候，将请求数据和返回数据以JSON格式存储到了数据库中，为了更高的处理这部分JSON数据，就用到了SparkSQL自带的一些JSON函数。这里做这些函数的方法做了一些整理。

get_json_object(json_txt, path)

get_json_object是SparkSQL提供的一个内置函数，用于从JSON字符串中提取嵌套的字段。该函数接受两个参数：一个JSON字符串和一个JSON路径。

第一个参数json_txt是一个包含JSON对象的字符串列。
第二个参数path是一个JSON路径，用于指定你想从JSON对象中提取哪个字段。这个路径应该以$开头，每个子字段用.分割。例如，如果JSON对象是{“a”:{“b”:1}}，你可以用$.a.b来提取值1。

以下是一个SQL查询的例子，它使用get_json_object函数提取JSON字符串中的字段：

SELECT get_json_object(json_column, '$.field1')
FROM table;

这个查询会返回json_column中field1字段的值。如果field1是一个嵌套的字段，你可以使用$.field1.subfield1这样的路径来提取它。

如果JSON路径不存在，或者JSON字符串无法解析，get_json_object函数会返回NULL。

如果你的JSON字符串包含一个数组(列表)，你仍然可以使用get_json_object函数。你只需要在JSON路径中加上数组索引。请注意，索引是从0开始的。

例如，假设json_column包含一个如下的JSON对象：

{"array":[1,2,3]}

你可以使用以下的查询来获取数组的第一个元素：

SELECT get_json_object(json_column, '$.array[0]')
FROM table;

这个查询会返回1。

如果你想获取数组的所有元素，你可以使用explode函数配合get_json_object使用。首先，你需要把JSON字符串转换为一个SparkSQL的array类型，然后使用explode函数创建一个新的行，每个行代表数组的一个元素。

如果你有一个包含多个JSON对象的数组，并且想从每个对象中提取id字段，你可以使用类似于以下的查询:

SELECT get_json_object(json_column, '$.array[*].id')
FROM table;

这种方法应该可以工作，但是它会返回一个包含所有id值的数组，而不是一个包含多行的表格。

json_array_length(jsonArray)

json_array_length的作用是返回JSON数组中元素的数量。

代码示例：

SELECT json_array_length('[1,2,3,4]');

或者可以结合get_json_object使用，例如：

SELECT json_array_length(get_json_object(json_text, '$.products[*].items'))
FROM table

json_object_keys(json_object)

以数组形式返回最外层JSON对象的所有键。示例：

json_object_keys({"f1":"abc","f2":{"f3":"a","f4":"b"}})

返回的是[f1,f2]

或者可以结合get_json_object使用，例如：

SELECT json_array_length(get_json_object(json_text, '$.products[0]'))
FROM table

json_tuple(jsonStr, p1, p2, …, pn)

json_tuple是SparkSQL中的一个函数，它用于从JSON字符串中提取特定的字段。它会返回一个包含n个字段值的元组。

你可以像下面这样使用json_tuple函数：

SELECT json_tuple(json_column, 'field1', 'field2', ..., 'fieldn')
FROM table;

在这个SQL语句中，json_column是你的JSON字符串所在的列，’field1′,’field2′,…,’fieldn’是你想从JSON字符串中提取的字段的名称。

例如，如果你有一个包含多个JSON对象的列，并且想从每个对象中提取id和name字段，你可以这样做：

SELECT json_tuple(json_column, 'id', 'name')
FROM table;

这将返回一个包含id和name字段值的元组。

schema_of_json(json[, options])

schema_of_json是SparkSQL中的一个函数，它用于从JSON字符串中推断出其schema。

示例：

SELECT schema_of_json('[{"col":0}]');
+---------------------------+
|schema_of_json([{"col":0}])|
+---------------------------+
|ARRAY<STRUCT<col:...|
+---------------------------+

SELECT schema_of_json('[{"col":01}]', map('allowNumericLeadingZeros', 'true'));
+----------------------------+
|schema_of_json([{"col":01}])|
+----------------------------+
|ARRAY<STRUCT<col:...|
+----------------------------+

但是个人城市后发现一直报错，原因好像是JSON字符串中有NULL值。due to data type mismatch: The input json should be a foldable string expression and not null;

from_json(jsonStr, schema[, options])

from_json是SparkSQL中的一个函数，它用于将JSON字符串解析为一个SparkSQL可以理解的结构化数据格式。

函数的第一个参数jsonStr是包含JSON数据的字符串。
第二个参数schema是一个描述所需数据结构的字符串，它定义了JSON数据的预期格式。

示例：

SELECT from_json('{"teacher":"Alice","student":[{"name":"Bob","rank":1},{"name":"Charlie","rank":2}]}','STRUCT<teacher:STRING,student:ARRAY<STRUCT<name:STRING,rank:INT>>>');

这里的 schema 可以使用 SparkSQL 中的 DDL（Data Definition Language）语法，该语法是一种更简洁的方式来定义数据结构。

以下是一些常用的数据类型和它们的 DDL 表示：

string：字符串类型。
int：整数类型。
double：双精度浮点数类型。
boolean：布尔类型。
date：日期类型。
timestamp：时间戳类型。

你还可以使用 DDL 语法来定义更复杂的数据类型，如数组、映射和结构等。例如：

array<type>：这定义了一个数组，数组中的元素都是同一类型。例如，array<int> 是一个整数数组。
map<type,type>：这定义了一个映射，其中的键和值可以是任何类型。例如，map<string,int> 是一个键为字符串、值为整数的映射。
struct<field1:type1,field2:type2,…>：这定义了一个结构，结构中可以包含多个字段，每个字段都有一个名称和类型。例如，struct<name:string,age:int> 是一个包含名为 “name” 的字符串字段和名为 “age” 的整数字段的结构。

关于你之前提到的 from_json 函数中的 schema，你可以使用 DDL 语法来定义 JSON 数据的预期结构。例如：

from_json(json_column,'struct<name:string,age:int,pets:array<string>,isAlive:boolean>')

在这个例子中，DDL schema 定义了一个名为 “name” 的字符串字段、一个名为 “age” 的整数字段、一个名为 “pets” 的字符串数组字段和一个名为 “isAlive” 的布尔字段。

to_json(expr[,options])

to_json 是 SparkSQL 的一个函数，它用于将一个结构化数据转换成 JSON 格式的字符串。

expr 是要转换成 JSON 的表达式，它通常是一个包含多个字段的结构或者一个 map。例如，你可以将一张表中的一行数据转换成 JSON：

SELECT to_json(struct(*)) as json
FROM table;

在这个 SQL 语句中，struct(*) 创建了一个包含表中所有字段的新结构，然后 to_json 函数将这个结构转换成了 JSON 格式的字符串。

选项参数 options 是一个 JSON 格式的字符串，包含了转换过程中可用的选项。一个常用的选项是 dateFormat，它用于指定日期字段在 JSON 中的表示方式。例如：

SELECT to_json(struct(*),'{"dateFormat":"dd/MM/yyyy"}') as json
FROM table;

在这个 SQL 语句中，我们设定了日期格式选项，这会影响到日期字段在 JSON 中的表示方式。

explode(from_json(…))

将 from_json 的内容展成多行（应用于 ARRAY，每个元素 1 行）

示例：

SELECT q.text
FROM (
SELECT explode(from_json('[{"text":"Tea"},{"text":"Apple"}]','ARRAY<STRUCT<text:STRING>>')) AS q
);

更复杂的代码示例：

SELECT inquiry_id, product_name, item_name, item_charge_type, item_upper
, item_lower, item_max_discount, res.name AS res_name, res.code AS res_code, res.carType AS res_car_type
, res.chargeType AS res_charge_type, res.distance AS res_distance, res.duration AS res_duration, res.seats AS res_seats, res.merchantPrice AS res_mer_price
, res.price AS res_price, res.finalPrice AS res_final_price, res.discount AS res_discount, res.heiJDiscount AS res_hj_discount, res.discounts AS res_discounts
, res.priceAdjust AS res_price_adjust
FROM (
SELECT inquiry_id, product_name, item.name AS item_name, item.chargeTypeName AS item_charge_type, item.upper AS item_upper
, item.lower AS item_lower, item.maxDiscount AS item_max_discount, item.resources AS item_resources
FROM (
SELECT inquiry_id, product.name AS product_name, product.items AS items
FROM (
SELECT inquiry_id, from_json(inquiry_result,'struct<products:array<struct<name:string,items:array<struct<name:string,chargeTypeName:string,upper:string,lower:string,maxDiscount:string,resources:array<struct<name:string,code:string,carType:int,chargeType:int,distance:int,duration:int,seats:int,merchantPrice:string,price:string,finalPrice:string,discount:string,heiJDiscount:string,discounts:array<struct<name:string,type:string,batchNo:string,flashSale:boolean,amount:string>>,priceAdjust:struct<search:struct<amount:string,percent:string>,booked:struct<amount:string,percent:string>>>>>>>>>') AS products
FROM table
) p
LATERAL VIEW explode(p.products.products) exploded_table AS product
) i
LATERAL VIEW explode(i.items) exploded_table AS item
) r
LATERAL VIEW explode(r.item_resources) exploded_table AS res

参考链接：

Built-in Functions – Spark 3.5.1 Documentation (apache.org)

Spark SQL JSON数据的解析

get_json_object(json_txt, path)

json_array_length(jsonArray)

json_object_keys(json_object)

json_tuple(jsonStr, p1, p2, …, pn)

schema_of_json(json[, options])

from_json(jsonStr, schema[, options])

to_json(expr[,options])

explode(from_json(…))

《软件随想录》读书笔记

创业时应该如何选择域名？

项目实践：正负样本文本的关键词提取

发表回复取消回复

Spark SQL JSON数据的解析

get_json_object(json_txt, path)

json_array_length(jsonArray)

json_object_keys(json_object)

json_tuple(jsonStr, p1, p2, …, pn)

schema_of_json(json[, options])

from_json(jsonStr, schema[, options])

to_json(expr[,options])

explode(from_json(…))

《软件随想录》读书笔记

创业时应该如何选择域名？

项目实践：正负样本文本的关键词提取

发表回复 取消回复

发表回复取消回复