python 3.x - Impossible to remove \n and \t in python3 string? -
so been trying format taken webpage cl can send email, come every time try remove \n
, \t
b'\n\n\n\t\n\t\n\t\n\t\n\t\n\t\n\n\n\n\t\n\n\n\t \n\t\t\t \n\t \n\t\t \n\t\t\t \n 0 favorites\n \n\n\t\t \n\t\t ∨ \n\t\t ∧ \n\t\t \n \n \n \n\t \tcl wenatchee personals casual encounters\n \n \n\t\t \n\t \n \n\n\t\t \n\t\t\t \n\t\n\t\t\n\t\n\n\n\nreply to: 59nv6-4031116628@pers.craigslist.org\n \n\n\n\t \n\t\n\t\tflag [?] :\n\t\t\n\t\t\tmiscategorized\n\t\t\n\t\t\tprohibited\n\t\t\n\t\t\tspam\n\t\t\n\t\t\tbest of\n\t\n \n\n\t\t posted: 2013-08-28, 8:23am pdt \n \n\n \n \n well... - w4m - 22 (wenatchee)\n
i have tried strip, replace , regex nothing fazes it, comes in email unaffected everything.
here's code:
try: if url.find('http://') == -1: url = 'http://wenatchee.craigslist.org' + url html = urlopen(url).read() html = str(html) html = re.sub('\s+',' ', html) print(html) part2 = mimetext(html, 'html') msg.attach(part2) s = smtplib.smtp('localhost') s.sendmail(me, you, msg.as_string()) s.quit()
your issue despite evidence contrary, still have bytes
object rather str
you're hoping for. attempts come nothing because without encoding specified, there's no way match (regexes, replacement parameters, etc) html
string.
what need decode bytes first.
and personally, favorite method cleaning whitespace use string.split
, string.join
. here's working example. remove runs of kind of whitespace, , replace them single spaces.
try: html = urlopen('http://wenatchee.craigslist.org').read() html = html.decode("utf-8") # decode bytes useful string # split string on whitespace, join again. html = ' '.join(html.split()) print(html) s.quit() except exception e: print(e)
Comments
Post a Comment