python - Reading UTF8 encoded CSV and converting to UTF-16 -

- February 15, 2010

i'm reading in csv file has utf8 encoding:

ifile = open(fname, "r") row in csv.reader(ifile):     name = row[0]     print repr(row[0])

this works fine, , prints out expect print out; utf8 encoded str:

> '\xc3\x81lvaro salazar' > '\xc3\x89lodie yung' ...

furthermore when print str (as opposed repr()) output displays ok (which don't understand eitherway - shouldn't cause error?):

> Álvaro salazar > Élodie yung

but when try convert utf8 encoded strs unicode:

ifile = open(fname, "r") row in csv.reader(ifile):     name = row[0]     print unicode(name, 'utf-8')  # or name.decode('utf-8')

i infamous:

traceback (most recent call last):                                        file "scripts/script.py", line 33, in <module>     print unicode(fullname, 'utf-8') unicodeencodeerror: 'ascii' codec can't encode character u'\xc1' in position 0: ordinal not in range(128)

so looked @ unicode strings created:

ifile = open(fname, "r") row in csv.reader(ifile):     name = row[0]     unicode_name = unicode(name, 'utf-8')     print repr(unicode_name)

and output

 > u'\xc1lvaro salazar'  > u'\xc9lodie yung'

so i'm totally confused these seem mangled hex values. i've read question:

reading utf8 csv file python

and appears doing correctly, leading me believe file not utf8, when print out repr values of cells, appear to correct utf8 hex values. can either point out problem or indicate understanding breaking down (as i'm starting lost in jungle of encodings)

as aside, believe use codecs open file , read directly unicode objects, csv module doesn't support unicode natively can use approach.

your default encoding ascii. when try print unicode object, interpreter therefore tries encode using ascii codec, fails because text includes characters don't exist in ascii.

the reason printing utf-8 encoded bytestring doesn't produce error (which seems confuse you, although shouldn't) sends bytes terminal. never produce python error, although may produce ugly output if terminal doesn't know bytes.

to print unicode, use print some_unicode.encode('utf-8'). (or whatever encoding terminal using).

as u'\xc1lvaro salazar', nothing here mangled. character Á @ unicode codepoint c1 (which has nothing it's utf-8 representation, happens same value in latin-1), , python uses \x hex escapes instead of \u unicode codepoint notation codepoints have 00 significant byte save space (it have displayed \u00c1.)

to overview of how unicode works in python, suggest http://nedbatchelder.com/text/unipain.html

Search This Blog

Code wiki

python - Reading UTF8 encoded CSV and converting to UTF-16 -

Comments

Post a Comment

Popular posts from this blog

design - Custom Styling Qt Quick Controls -

sql - Is there any inbuilt stored procedure which will return the output of a query as an XML document..? -

Unable to remove the www from url on https using .htaccess -