python - Reading UTF8 encoded CSV and converting to UTF-16 -
i'm reading in csv file has utf8 encoding:
ifile = open(fname, "r") row in csv.reader(ifile): name = row[0] print repr(row[0])
this works fine, , prints out expect print out; utf8 encoded str
:
> '\xc3\x81lvaro salazar' > '\xc3\x89lodie yung' ...
furthermore when print str
(as opposed repr()
) output displays ok (which don't understand eitherway - shouldn't cause error?):
> Álvaro salazar > Élodie yung
but when try convert utf8 encoded strs
unicode
:
ifile = open(fname, "r") row in csv.reader(ifile): name = row[0] print unicode(name, 'utf-8') # or name.decode('utf-8')
i infamous:
traceback (most recent call last): file "scripts/script.py", line 33, in <module> print unicode(fullname, 'utf-8') unicodeencodeerror: 'ascii' codec can't encode character u'\xc1' in position 0: ordinal not in range(128)
so looked @ unicode strings created:
ifile = open(fname, "r") row in csv.reader(ifile): name = row[0] unicode_name = unicode(name, 'utf-8') print repr(unicode_name)
and output
> u'\xc1lvaro salazar' > u'\xc9lodie yung'
so i'm totally confused these seem mangled hex values. i've read question:
and appears doing correctly, leading me believe file not utf8, when print out repr
values of cells, appear to correct utf8 hex values. can either point out problem or indicate understanding breaking down (as i'm starting lost in jungle of encodings)
as aside, believe use codecs
open file , read directly unicode objects, csv
module doesn't support unicode natively can use approach.
your default encoding ascii. when try print unicode
object, interpreter therefore tries encode using ascii codec, fails because text includes characters don't exist in ascii.
the reason printing utf-8 encoded bytestring doesn't produce error (which seems confuse you, although shouldn't) sends bytes terminal. never produce python error, although may produce ugly output if terminal doesn't know bytes.
to print unicode, use print some_unicode.encode('utf-8')
. (or whatever encoding terminal using).
as u'\xc1lvaro salazar'
, nothing here mangled. character Á
@ unicode codepoint c1 (which has nothing it's utf-8 representation, happens same value in latin-1), , python uses \x
hex escapes instead of \u
unicode codepoint notation codepoints have 00 significant byte save space (it have displayed \u00c1
.)
to overview of how unicode works in python, suggest http://nedbatchelder.com/text/unipain.html
Comments
Post a Comment