hadoop - Unstructured data into structured data using Pig -
i'm trying structure un-structured data using pig doing processing.
here's sample of data:
nov 1 18:23:34 dev_id=03 user_id=000 int_ip=198.0.13.24 ext_ip=68.67.0.14 src_port=99 dest_port=213 response_code=5
expected output:
nov 1 18:23:34, 03 , 000, 198.0.13.24, 68.67.0.14, 99, 213, 5
as can see data not separated (like tab or comma), tried load data using '\t' , dumped on terminal.
a = load '----' using pigstorage('\t') (mnth: chararray, day: int, --------); dump a; store '\root\output';
output:
dump output:
(nov,1,18:23:34,dev_id=03,user_id=000,int_ip=198.0.13.24,ext_ip=68.67.0.14,src_port=99,dest_port=213,response_code=5)
store oputut: results stored same input, not dump(comma separated).
nov 1 18:23:34 dev_id=03 user_id=000 int_ip=198.0.13.24 ext_ip=68.67.0.14 src_port=99 dest_port=213 response_code=5
alternative: tried load data using datastorage() (value: varchar) , performed tokenize also, not able achieve objective.
few more suggestion need:
as stored 3 fields month:"nov", day:"1", , time:"18:23:34". possible join 3 fields time: "nov 1 18:23:34".
all data stored information dev_id=03, user_id=000 need remove information , stored information 03,000,198.0.13.24 etc.
is possible processing using pig or need write mapreduce program.
edit:1
after getting comment, tried regex_extract single column works fine. multiple column, tried regex_extract_all follows:
a = load '----' using pigstorage('\t') (mnth: chararray, day: int, dev: chararray, user: chararray --------); b = foreach generate regex_extract_all(devid, userid, '(^.*=(.*)$) (^.*=(.*)$)'); dump b;
i got error:
error: error org.apache.pig.tools.grunt.grunt - error 1070: not resolve regex_extract_all using imports.
can extract multiple fields using regex_extract_all.
just write custom loader data , issues solved java. example of doing step step can found here
Comments
Post a Comment