the request module encoding
provides different encoding actual set encoding in html page
code:
import requests url = "http://www.reynamining.com/nuevositio/contacto.html" obj = requests.get(url, timeout=60, verify=false, allow_redirects=true) print obj.encoding
output:
iso-8859-1
where actual encoding set in html utf-8
content="text/html; charset=utf-8"
my question are:
- why
requests.encoding
showing different encoding encoding described in html page?.
i trying convert encoding utf-8 using method objreq.content.decode(encodes).encode("utf-8")
since in utf-8
when decode iso-8859-1 , encode utf-8 values changed i.e.) á
changes Ã
is there way convert type of encodes utf-8 ?
requests sets response.encoding
attribute iso-8859-1
when have text/*
response , no content type has been specified in response headers.
see encoding section of advanced documentation:
the time requests not if no explicit charset present in http headers and
content-type
header containstext
. in situation, rfc 2616 specifies default charset mustiso-8859-1
. requests follows specification in case. if require different encoding, can manually setresponse.encoding
property, or use rawresponse.content
.
bold emphasis mine.
you can test looking charset
parameter in content-type
header:
resp = requests.get(....) encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else none
your html document specifies content type in <meta>
header, , header authoritative:
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
html 5 defines <meta charset="..." />
tag, see <meta charset="utf-8"> vs <meta http-equiv="content-type">
you should not recode html pages utf-8 if contain such header different codec. must @ least correct header in case.
using beautifulsoup:
# pass in explicit encoding if set header encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else none content = resp.content soup = beautifulsoup(content, from_encoding=encoding) if soup.original_encoding != 'utf-8': meta = soup.select_one('meta[charset], meta[http-equiv="content-type"]') if meta: # replace meta charset info before re-encoding if 'charset' in meta.attrs: meta['charset'] = 'utf-8' else: meta['content'] = 'text/html; charset=utf-8' # re-encode utf-8 content = soup.prettify() # encodes utf-8 default
similarly, other document standards may specify specific encodings; xml example utf-8 unless specified <?xml encoding="..." ... ?>
xml declaration, again part of document.
Comments
Post a Comment