invalid checksum OSS数据文件格式处理大全

很多朋友对于invalid checksum和OSS数据文件格式处理大全不太懂,今天就由小编来为大家分享,希望可以帮助到大家,下面一起来看看吧!

DataLakeAnalytics是Serverless化的云上交互式查询分析服务。用户可以使用标准的SQL语句,对存储在OSS、TableStore上的数据无需移动,直接进行查询分析。

目前该产品已经正式登陆阿里云,欢迎大家申请试用,体验更便捷的数据分析服务。

请参考https://help.aliyun.com/document_detail/70386.html进行产品开通服务申请。

在上一篇教程中,我们介绍了如何分析CSV格式的TPC-H数据集。除了纯文本文件(例如,CSV,TSV等),用户存储在OSS上的其他格式的数据文件,也可以使用DataLakeAnalytics进行查询分析,包括ORC,PARQUET,JSON,RCFILE,AVRO甚至ESRI规范的地理JSON数据,还可以用正则表达式匹配的文件等。

本文详细介绍如何根据存储在OSS上的文件格式使用DataLakeAnalytics(下文简称DLA)进行分析。DLA内置了各种处理文件数据的SerDe(Serialize/Deserilize的简称,目的是用于序列化和反序列化)实现,用户无需自己编写程序,基本上能选用DLA中的一款或多款SerDe来匹配您OSS上的数据文件格式。如果还不能满足您特殊文件格式的处理需求,请联系我们,尽快为您实现。

用户可以依据存储在OSS上的数据文件进行建表,通过STOREDAS指定数据文件的格式。

CREATEEXTERNALTABLEnation(nN_NATIONKEYINT,nN_NAMESTRING,nN_REGIONKEYINT,nN_COMMENTSTRINGn)nROWFORMATDELIMITEDFIELDSTERMINATEDBY'|'nSTOREDASTEXTFILEnLOCATION'oss://test-bucket-julian-1/tpch_100m/nation';n

建表成功后可以使用SHOWCREATETABLE语句查看原始建表语句。

mysql>showcreatetablenation;n+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+n|Result|n+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+n|CREATEEXTERNALTABLE`nation`(n`n_nationkey`int,n`n_name`string,n`n_regionkey`int,n`n_comment`string)nROWFORMATDELIMITEDnFIELDSTERMINATEDBY'|'nSTOREDAS`TEXTFILE`nLOCATIONn'oss://test-bucket-julian-1/tpch_100m/nation'|n+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+n1rowinset(1.81sec)n

下表中列出了目前DLA已经支持的文件格式,当针对下列格式的文件建表时,可以直接使用STOREDAS,DLA会选择合适的SERDE/INPUTFORMAT/OUTPUTFORMAT。

数据文件的存储格式为纯文本文件。默认的文件类型。

文件中的每一行对应表中的一条记录。

数据文件的存储格式为JSON(EsriArcGIS的地理JSON数据文件除外)。

在指定了STOREDAS的同时,还可以根据具体文件的特点,指定SerDe(用于解析数据文件并映射到DLA表),特殊的列分隔符等。

CSV文件,本质上还是纯文本文件,可以使用STOREDASTEXTFILE。

列与列之间以逗号分隔,可以通过ROWFORMATDELIMITEDFIELDSTERMINATEDBY','表示。

例如,数据文件oss://bucket-for-testing/oss/text/cities/city.csv的内容为

Beijing,China,010nShangHai,China,021nTianjin,China,022n

建表语句可以为

CREATEEXTERNALTABLEcity(ncitySTRING,ncountrySTRING,ncodeINTn)nROWFORMATDELIMITEDFIELDSTERMINATEDBY','nSTOREDASTEXTFILEnLOCATION'oss://bucket-for-testing/oss/text/cities';n

使用OpenCSVSerde__处理引号__引用的字段

OpenCSVSerde在使用时需要注意以下几点:

CREATEEXTERNALTABLEtest_csv_opencsvserde(nidSTRING,nnameSTRING,nlocationSTRING,ncreate_dateSTRING,ncreate_timestampSTRING,nlongitudeSTRING,nlatitudeSTRINGn)nROWFORMATSERDE'org.apache.hadoop.hive.serde2.OpenCSVSerde'nwithserdeproperties(n"separatorChar"=",",n"quoteChar"="\"",n"escapeChar"="\\"n)nSTOREDASTEXTFILELOCATION'oss://test-bucket-julian-1/test_csv_serde_1';n

自定义分隔符

需要自定义列分隔符(FIELDSTERMINATEDBY),转义字符(ESCAPEDBY),行结束符(LINESTERMINATEDBY)。

ROWFORMATDELIMITEDnFIELDSTERMINATEDBY'\t'nESCAPEDBY'\\'nLINESTERMINATEDBY'\n'n

忽略CSV文件中的HEADER

在csv文件中,有时会带有HEADER信息,需要在数据读取时忽略掉这些内容。这时需要在建表语句中定义skip.header.line.count。

例如,数据文件oss://my-bucket/datasets/tpch/nation_csv/nation_header.tbl的内容如下:

N_NATIONKEY|N_NAME|N_REGIONKEY|N_COMMENTn0|ALGERIA|0|haggle.carefullyfinaldepositsdetectslylyagai|n1|ARGENTINA|1|alfoxespromiseslylyaccordingtotheregularaccounts.boldrequestsalon|n2|BRAZIL|1|yalongsideofthependingdeposits.carefullyspecialpackagesareabouttheironicforges.slylyspecial|n3|CANADA|1|eashangironic,silentpackages.slylyregularpackagesarefuriouslyoverthetithes.fluffilybold|n4|EGYPT|4|yabovethecarefullyunusualtheodolites.finaldugoutsarequicklyacrossthefuriouslyregulard|n5|ETHIOPIA|0|venpackageswakequickly.regu|n

相应的建表语句为:

CREATEEXTERNALTABLEnation_header(nN_NATIONKEYINT,nN_NAMESTRING,nN_REGIONKEYINT,nN_COMMENTSTRINGn)nROWFORMATDELIMITEDFIELDSTERMINATEDBY'|'nSTOREDASTEXTFILEnLOCATION'oss://my-bucket/datasets/tpch/nation_csv/nation_header.tbl'nTBLPROPERTIES("skip.header.line.count"="1");n

skip.header.line.count的取值x和数据文件的实际行数n有如下关系:

与CSV文件类似,TSV格式的文件也是纯文本文件,列与列之间的分隔符为Tab。

例如,数据文件oss://bucket-for-testing/oss/text/cities/city.tsv的内容为

BeijingChina010nShangHaiChina021nTianjinChina022n

建表语句可以为

CREATEEXTERNALTABLEcity(ncitySTRING,ncountrySTRING,ncodeINTn)nROWFORMATDELIMITEDFIELDSTERMINATEDBY'\t'nSTOREDASTEXTFILEnLOCATION'oss://bucket-for-testing/oss/text/cities';n

2.3多字符数据字段分割符文件

假设您的数据字段的分隔符包含多个字符,可采用如下示例建表语句,其中每行的数据字段分割符为“||”,可以替换为您具体的分割符字符串。

ROWFORMATSERDE'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'nwithserdeproperties(n"field.delim"="||"n)n

示例:

CREATEEXTERNALTABLEtest_csv_multidelimit(nidSTRING,nnameSTRING,nlocationSTRING,ncreate_dateSTRING,ncreate_timestampSTRING,nlongitudeSTRING,nlatitudeSTRINGn)nROWFORMATSERDE'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'nwithserdeproperties(n"field.delim"="||"n)nSTOREDASTEXTFILELOCATION'oss://bucket-for-testing/oss/text/cities/';n

2.4JSON文件

DLA可以处理的JSON文件通常以纯文本的格式存储,在建表时除了要指定STOREDASTEXTFILE,还要定义SERDE。

在JSON文件中,每行必须是一个完整的JSON对象。

例如,下面的文件格式是不被接受的

{"id":123,"name":"jack",n"c3":"2001-02-0312:34:56"}n{"id":456,"name":"rose","c3":"1906-04-1805:12:00"}n{"id":789,"name":"tom","c3":"2001-02-0312:34:56"}n{"id":234,"name":"alice","c3":"1906-04-1805:12:00"}n

需要改写成:

{"id":123,"name":"jack","c3":"2001-02-0312:34:56"}n{"id":456,"name":"rose","c3":"1906-04-1805:12:00"}n{"id":789,"name":"tom","c3":"2001-02-0312:34:56"}n{"id":234,"name":"alice","c3":"1906-04-1805:12:00"}n

不含嵌套的JSON数据

CREATEEXTERNALTABLEt1(idint,namestring,c3timestamp)nSTOREDASJSONnLOCATION'oss://path/to/t1/directory';n

含有嵌套的JSON文件

使用struct和array结构定义嵌套的JSON数据。

例如,用户原始数据(注意:无论是否嵌套,一条完整的JSON数据都只能放在一行上,才能被DataLakeAnalytics处理):

{"DocId":"Alibaba","User_1":{"Id":1234,"Username":"bob1234","Name":"Bob","ShippingAddress":{"Address1":"969WenyiWestSt.","Address2":null,"City":"Hangzhou","Province":"Zhejiang"},"Orders":[{"ItemId":6789,"OrderDate":"11/11/2017"},{"ItemId":4352,"OrderDate":"12/12/2017"}]}}n

使用在线JSON格式化工具格式化后,数据内容如下:

{n"DocId":"Alibaba",n"User_1":{n"Id":1234,n"Username":"bob1234",n"Name":"Bob",n"ShippingAddress":{n"Address1":"969WenyiWestSt.",n"Address2":null,n"City":"Hangzhou",n"Province":"Zhejiang"n},n"Orders":[n{n"ItemId":6789,n"OrderDate":"11/11/2017"n},n{n"ItemId":4352,n"OrderDate":"12/12/2017"n}n]n}n}n

则建表语句可以写成如下(注意:LOCATION中指定的路径必须是JSON数据文件所在的目录,该目录下的所有JSON文件都能被识别为该表的数据):

CREATEEXTERNALTABLEjson_table_1(ndocidstring,nuser_1struct<nid:INT,nusername:string,nname:string,nshippingaddress:struct<naddress1:string,naddress2:string,ncity:string,nprovince:stringn>,norders:array<nstruct<nitemid:INT,norderdate:stringn>n>n>n)nSTOREDASJSONnLOCATION'oss://xxx/test/json/hcatalog_serde/table_1/';n

对该表进行查询:

select*fromjson_table_1;n+---------+----------------------------------------------------------------------------------------------------------------+n|docid|user_1|n+---------+----------------------------------------------------------------------------------------------------------------+n|Alibaba|[1234,bob1234,Bob,[969WenyiWestSt.,null,Hangzhou,Zhejiang],[[6789,11/11/2017],[4352,12/12/2017]]]|n+---------+----------------------------------------------------------------------------------------------------------------+n

对于struct定义的嵌套结构,可以通过“.”进行层次对象引用,对于array定义的数组结构,可以通过“[数组下标]”(注意:数组下标从1开始)进行对象引用。

selectDocId,nUser_1.Id,nUser_1.ShippingAddress.Address1,nUser_1.Orders[1].ItemIdnfromjson_table_1nwhereUser_1.Username='bob1234'nandUser_1.Orders[2].OrderDate='12/12/2017';n+---------+------+--------------------+-------+n|DocId|id|address1|_col3|n+---------+------+--------------------+-------+n|Alibaba|1234|969WenyiWestSt.|6789|n+---------+------+--------------------+-------+n

使用JSON函数处理数据

例如,把“value_string”的嵌套JSON值作为字符串存储:

{"data_key":"com.taobao.vipserver.domains.meta.biz.alibaba.com","ts":1524550275112,"value_string":"{\"appName\":\"\",\"apps\":[],\"checksum\":\"50fa0540b430904ee78dff07c7350e1c\",\"clusterMap\":{\"DEFAULT\":{\"defCkport\":80,\"defIPPort\":80,\"healthCheckTask\":null,\"healthChecker\":{\"checkCode\":200,\"curlHost\":\"\",\"curlPath\":\"/status.taobao\",\"type\":\"HTTP\"},\"name\":\"DEFAULT\",\"nodegroup\":\"\",\"sitegroup\":\"\",\"submask\":\"0.0.0.0/0\",\"syncConfig\":{\"appName\":\"trade-ma\",\"nodegroup\":\"tradema\",\"pubLevel\":\"publish\",\"role\":\"\",\"site\":\"\"},\"useIPPort4Check\":true}},\"disabledSites\":[],\"enableArmoryUnit\":false,\"enableClientBeat\":false,\"enableHealthCheck\":true,\"enabled\":true,\"envAndSites\":\"\",\"invalidThreshold\":0.6,\"ipDeleteTimeout\":1800000,\"lastModifiedMillis\":1524550275107,\"localSiteCall\":true,\"localSiteThreshold\":0.8,\"name\":\"biz.alibaba.com\",\"nodegroup\":\"\",\"owners\":[\"junlan.zx\",\"张三\",\"李四\",\"cui.yuanc\"],\"protectThreshold\":0,\"requireSameEnv\":false,\"resetWeight\":false,\"symmetricCallType\":null,\"symmetricType\":\"warehouse\",\"tagName\":\"ipGroup\",\"tenantId\":\"\",\"tenants\":[],\"token\":\"1cf0ec0c771321bb4177182757a67fb0\",\"useSpecifiedURL\":false}"}n

使用在线JSON格式化工具格式化后,数据内容如下:

{n"data_key":"com.taobao.vipserver.domains.meta.biz.alibaba.com",n"ts":1524550275112,n"value_string":"{\"appName\":\"\",\"apps\":[],\"checksum\":\"50fa0540b430904ee78dff07c7350e1c\",\"clusterMap\":{\"DEFAULT\":{\"defCkport\":80,\"defIPPort\":80,\"healthCheckTask\":null,\"healthChecker\":{\"checkCode\":200,\"curlHost\":\"\",\"curlPath\":\"/status.taobao\",\"type\":\"HTTP\"},\"name\":\"DEFAULT\",\"nodegroup\":\"\",\"sitegroup\":\"\",\"submask\":\"0.0.0.0/0\",\"syncConfig\":{\"appName\":\"trade-ma\",\"nodegroup\":\"tradema\",\"pubLevel\":\"publish\",\"role\":\"\",\"site\":\"\"},\"useIPPort4Check\":true}},\"disabledSites\":[],\"enableArmoryUnit\":false,\"enableClientBeat\":false,\"enableHealthCheck\":true,\"enabled\":true,\"envAndSites\":\"\",\"invalidThreshold\":0.6,\"ipDeleteTimeout\":1800000,\"lastModifiedMillis\":1524550275107,\"localSiteCall\":true,\"localSiteThreshold\":0.8,\"name\":\"biz.alibaba.com\",\"nodegroup\":\"\",\"owners\":[\"junlan.zx\",\"张三\",\"李四\",\"cui.yuanc\"],\"protectThreshold\":0,\"requireSameEnv\":false,\"resetWeight\":false,\"symmetricCallType\":null,\"symmetricType\":\"warehouse\",\"tagName\":\"ipGroup\",\"tenantId\":\"\",\"tenants\":[],\"token\":\"1cf0ec0c771321bb4177182757a67fb0\",\"useSpecifiedURL\":false}"n}n

建表语句为

CREATEexternalTABLEjson_table_2(ndata_keystring,ntsbigint,nvalue_stringstringn)nSTOREDASJSONnLOCATION'oss://xxx/test/json/hcatalog_serde/table_2/';n

表建好后,可进行查询:

select*fromjson_table_2;n+---------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+n|data_key|ts|value_string|n+---------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+n|com.taobao.vipserver.domains.meta.biz.alibaba.com|1524550275112|{"appName":"","apps":[],"checksum":"50fa0540b430904ee78dff07c7350e1c","clusterMap":{"DEFAULT":{"defCkport":80,"defIPPort":80,"healthCheckTask":null,"healthChecker":{"checkCode":200,"curlHost":"","curlPath":"/status.taobao","type":"HTTP"},"name":"DEFAULT","nodegroup":"","sitegroup":"","submask":"0.0.0.0/0","syncConfig":{"appName":"trade-ma","nodegroup":"tradema","pubLevel":"publish","role":"","site":""},"useIPPort4Check":true}},"disabledSites":[],"enableArmoryUnit":false,"enableClientBeat":false,"enableHealthCheck":true,"enabled":true,"envAndSites":"","invalidThreshold":0.6,"ipDeleteTimeout":1800000,"lastModifiedMillis":1524550275107,"localSiteCall":true,"localSiteThreshold":0.8,"name":"biz.alibaba.com","nodegroup":"","owners":["junlan.zx","张三","李四","cui.yuanc"],"protectThreshold":0,"requireSameEnv":false,"resetWeight":false,"symmetricCallType":null,"symmetricType":"warehouse","tagName":"ipGroup","tenantId":"","tenants":[],"token":"1cf0ec0c771321bb4177182757a67fb0","useSpecifiedURL":false}|n+---------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+n

下面SQL示例json_parse,json_extract_scalar,json_extract等常用JSON函数的使用方式:

mysql>selectjson_extract_scalar(json_parse(value),'$.owners[1]')fromjson_table_2;n+--------+n|_col0|n+--------+n|张三|n+--------+nmysql>selectjson_extract_scalar(json_obj.json_col,'$.DEFAULT.submask')nfrom(nselectjson_extract(json_parse(value),'$.clusterMap')asjson_colfromjson_table_2n)json_objnwherejson_extract_scalar(json_obj.json_col,'$.DEFAULT.healthChecker.curlPath')='/status.taobao';n+-----------+n|_col0|n+-----------+n|0.0.0.0/0|n+-----------+nmysql>withjson_objas(selectjson_extract(json_parse(value),'$.clusterMap')asjson_colfromjson_table_2)nselectjson_extract_scalar(json_obj.json_col,'$.DEFAULT.submask')nfromjson_objnwherejson_extract_scalar(json_obj.json_col,'$.DEFAULT.healthChecker.curlPath')='/status.taobao';n+-----------+n|_col0|n+-----------+n|0.0.0.0/0|n+-----------+n

2.5ORC文件

OptimizedRowColumnar(ORC)是Apache开源项目Hive支持的一种优化的列存储文件格式。与CSV文件相比,不仅可以节省存储空间,还可以得到更好的查询性能。

对于ORC文件,只需要在建表时指定STOREDASORC。

CREATEEXTERNALTABLEorders_orc_date(nO_ORDERKEYINT,nO_CUSTKEYINT,nO_ORDERSTATUSSTRING,nO_TOTALPRICEDOUBLE,nO_ORDERDATEDATE,nO_ORDERPRIORITYSTRING,nO_CLERKSTRING,nO_SHIPPRIORITYINT,nO_COMMENTSTRINGn)nSTOREDASORCnLOCATION'oss://bucket-for-testing/datasets/tpch/1x/orc_date/orders_orc';n

2.6PARQUET文件

Parquet是Apache开源项目Hadoop支持的一种列存储的文件格式。

使用DLA建表时,需要指定STOREDASPARQUET即可。

CREATEEXTERNALTABLEorders_parquet_date(nO_ORDERKEYINT,nO_CUSTKEYINT,nO_ORDERSTATUSSTRING,nO_TOTALPRICEDOUBLE,nO_ORDERDATEDATE,nO_ORDERPRIORITYSTRING,nO_CLERKSTRING,nO_SHIPPRIORITYINT,nO_COMMENTSTRINGn)nSTOREDASPARQUETnLOCATION'oss://bucket-for-testing/datasets/tpch/1x/parquet_date/orders_parquet';n

2.7RCFILE文件

RecordColumnarFile(RCFile),列存储文件,可以有效地将关系型表结构存储在分布式系统中,并且可以被高效地读取和处理。

DLA在建表时,需要指定STOREDASRCFILE。

CREATEEXTERNALTABLElineitem_rcfile_date(nL_ORDERKEYINT,nL_PARTKEYINT,nL_SUPPKEYINT,nL_LINENUMBERINT,nL_QUANTITYDOUBLE,nL_EXTENDEDPRICEDOUBLE,nL_DISCOUNTDOUBLE,nL_TAXDOUBLE,nL_RETURNFLAGSTRING,nL_LINESTATUSSTRING,nL_SHIPDATEDATE,nL_COMMITDATEDATE,nL_RECEIPTDATEDATE,nL_SHIPINSTRUCTSTRING,nL_SHIPMODESTRING,nL_COMMENTSTRINGn)nSTOREDASRCFILEnLOCATION'oss://bucke-for-testing/datasets/tpch/1x/rcfile_date/lineitem_rcfile'n

2.8AVRO文件

DLA针对AVRO文件建表时,需要指定STOREDASAVRO,并且定义的字段需要符合AVRO文件的schema。

如果不确定可以通过使用Avro提供的工具,获得schema,并根据schema建表。

在ApacheAvro官网下载avro-tools-.jar到本地,执行下面的命令获得Avro文件的schema:

java-jaravro-tools-1.8.2.jargetschema/path/to/your/doctors.avron{n"type":"record",n"name":"doctors",n"namespace":"testing.hive.avro.serde",n"fields":[{n"name":"number",n"type":"int",n"doc":"Orderofplayingtherole"n},{n"name":"first_name",n"type":"string",n"doc":"firstnameofactorplayingrole"n},{n"name":"last_name",n"type":"string",n"doc":"lastnameofactorplayingrole"n}]n}n

建表语句如下,其中fields中的name对应表中的列名,type需要参考本文档中的表格转成hive支持的类型

CREATEEXTERNALTABLEdoctors(nnumberint,nfirst_namestring,nlast_namestring)nSTOREDASAVROnLOCATION'oss://mybucket-for-testing/directory/to/doctors';n

大多数情况下,Avro的类型可以直接转换成Hive中对应的类型。如果该类型在Hive不支持,则会转换成接近的类型。具体请参照下表:

Avro类型对应Hive类型nullvoidbooleanbooleanintintlongbigintfloatfloatdoubledoublebytesbinarystringstringrecordstructmapmaplistarrayunionunionenumstringfixedbinary2.9可以用正则表达式匹配的文件

通常此类型的文件是以纯文本格式存储在OSS上的,每一行代表表中的一条记录,并且每行可以用正则表达式匹配。

例如,ApacheWebServer日志文件就是这种类型的文件。

127.0.0.1-frank[10/Oct/2000:13:55:36-0700]"GET/apache_pb.gifHTTP/1.0"2002326n127.0.0.1--[26/May/2009:00:00:00+0000]"GET/someurl/?track=Blabla(Main)HTTP/1.1"2005864-"Mozilla/5.0(Windows;U;WindowsNT6.0;en-US)AppleWebKit/525.19(KHTML,likeGecko)Chrome/1.0.154.65Safari/525.19"n

每行文件可以用下面的正则表达式表示,列之间使用空格分隔:

([^]*)([^]*)([^]*)(-|\\[[^\\]]*\\])([^\"]*|\"[^\"]*\")(-|[0-9]*)(-|[0-9]*)(?:([^\"]*|\"[^\"]*\")([^\"]*|\"[^\"]*\"))?n

针对上面的文件格式,建表语句可以表示为:

CREATEEXTERNALTABLEserde_regex(nhostSTRING,nidentitySTRING,nuserNameSTRING,ntimeSTRING,nrequestSTRING,nstatusSTRING,nsizeINT,nrefererSTRING,nagentSTRING)nROWFORMATSERDE'org.apache.hadoop.hive.serde2.RegexSerDe'nWITHSERDEPROPERTIES(n"input.regex"="([^]*)([^]*)([^]*)(-|\\[[^\\]]*\\])([^\"]*|\"[^\"]*\")(-|[0-9]*)(-|[0-9]*)(?:([^\"]*|\"[^\"]*\")([^\"]*|\"[^\"]*\"))?"n)nSTOREDASTEXTFILEnLOCATION'oss://bucket-for-testing/datasets/serde/regex';n

查询结果

mysql>select*fromserde_regex;n+-----------+----------+-------+------------------------------+---------------------------------------------+--------+------+---------+--------------------------------------------------------------------------------------------------------------------------+n|host|identity|userName|time|request|status|size|referer|agent|n+-----------+----------+-------+------------------------------+---------------------------------------------+--------+------+---------+--------------------------------------------------------------------------------------------------------------------------+n|127.0.0.1|-|frank|[10/Oct/2000:13:55:36-0700]|"GET/apache_pb.gifHTTP/1.0"|200|2326|NULL|NULL|n|127.0.0.1|-|-|[26/May/2009:00:00:00+0000]|"GET/someurl/?track=Blabla(Main)HTTP/1.1"|200|5864|-|"Mozilla/5.0(Windows;U;WindowsNT6.0;en-US)AppleWebKit/525.19(KHTML,likeGecko)Chrome/1.0.154.65Safari/525.19"|n+-----------+----------+-------+------------------------------+---------------------------------------------+--------+------+---------+--------------------------------------------------------------------------------------------------------------------------+n

2.10EsriArcGIS的地理JSON数据文件

DLA支持EsriArcGIS的地理JSON数据文件的SerDe处理,关于这种地理JSON数据格式说明,可以参考:https://github.com/Esri/spatial-framework-for-hadoop/wiki/JSON-Formats

CREATEEXTERNALTABLEIFNOTEXISTScalifornia_countiesn(nNamestring,nBoundaryShapebinaryn)nROWFORMATSERDE'com.esri.hadoop.hive.serde.JsonSerde'nSTOREDASINPUTFORMAT'com.esri.json.hadoop.EnclosedJsonInputFormat'nOUTPUTFORMAT'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'nLOCATION'oss://test_bucket/datasets/geospatial/california-counties/'n3.总结

通过以上例子可以看出,DLA可以支持大部分开源存储格式的文件。对于同一份数据,使用不同的存储格式,在OSS中存储文件的大小,DLA的查询分析速度上会有较大的差别。推荐使用ORC格式进行文件的存储和查询。

为了获得更快的查询速度,DLA还在不断的优化中,后续也会支持更多的数据源,为用户带来更好的大数据分析体验。

好了,文章到这里就结束啦,如果本次分享的invalid checksum和OSS数据文件格式处理大全问题对您有所帮助,还望关注下本站哦!