hadoop - Unstructured data into structured data using Pig -


i'm trying structure un-structured data using pig doing processing.

here's sample of data:

nov 1   18:23:34    dev_id=03   user_id=000 int_ip=198.0.13.24  ext_ip=68.67.0.14   src_port=99 dest_port=213   response_code=5 

expected output:

nov 1 18:23:34, 03 , 000, 198.0.13.24, 68.67.0.14, 99, 213, 5 

as can see data not separated (like tab or comma), tried load data using '\t' , dumped on terminal.

a = load '----' using pigstorage('\t') (mnth: chararray, day: int, --------);  dump a;  store '\root\output'; 

output:

dump output:

(nov,1,18:23:34,dev_id=03,user_id=000,int_ip=198.0.13.24,ext_ip=68.67.0.14,src_port=99,dest_port=213,response_code=5) 

store oputut: results stored same input, not dump(comma separated).

nov 1   18:23:34    dev_id=03   user_id=000 int_ip=198.0.13.24  ext_ip=68.67.0.14   src_port=99 dest_port=213   response_code=5 

alternative: tried load data using datastorage() (value: varchar) , performed tokenize also, not able achieve objective.

few more suggestion need:

  1. as stored 3 fields month:"nov", day:"1", , time:"18:23:34". possible join 3 fields time: "nov 1 18:23:34".

  2. all data stored information dev_id=03, user_id=000 need remove information , stored information 03,000,198.0.13.24 etc.

is possible processing using pig or need write mapreduce program.

edit:1

after getting comment, tried regex_extract single column works fine. multiple column, tried regex_extract_all follows:

a = load '----' using pigstorage('\t') (mnth: chararray, day: int, dev: chararray, user: chararray --------);  b = foreach generate regex_extract_all(devid, userid, '(^.*=(.*)$) (^.*=(.*)$)');  dump b; 

i got error:

error: error org.apache.pig.tools.grunt.grunt - error 1070: not resolve regex_extract_all using imports. 

can extract multiple fields using regex_extract_all.

just write custom loader data , issues solved java. example of doing step step can found here


Comments

Popular posts from this blog

Unable to remove the www from url on https using .htaccess -