python - Requests module encoding provides different encode then HTML encode -


the request module encoding provides different encoding actual set encoding in html page

code:

import requests url = "http://www.reynamining.com/nuevositio/contacto.html" obj = requests.get(url, timeout=60, verify=false, allow_redirects=true) print obj.encoding 

output:

iso-8859-1 

where actual encoding set in html utf-8 content="text/html; charset=utf-8"

my question are:

  1. why requests.encoding showing different encoding encoding described in html page?.

i trying convert encoding utf-8 using method objreq.content.decode(encodes).encode("utf-8") since in utf-8 when decode iso-8859-1 , encode utf-8 values changed i.e.) áchanges Ã

is there way convert type of encodes utf-8 ?

requests sets response.encoding attribute iso-8859-1 when have text/* response , no content type has been specified in response headers.

see encoding section of advanced documentation:

the time requests not if no explicit charset present in http headers and content-type header contains text. in situation, rfc 2616 specifies default charset must iso-8859-1. requests follows specification in case. if require different encoding, can manually set response.encoding property, or use raw response.content.

bold emphasis mine.

you can test looking charset parameter in content-type header:

resp = requests.get(....) encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else none 

your html document specifies content type in <meta> header, , header authoritative:

<meta http-equiv="content-type" content="text/html; charset=utf-8" /> 

html 5 defines <meta charset="..." /> tag, see <meta charset="utf-8"> vs <meta http-equiv="content-type">

you should not recode html pages utf-8 if contain such header different codec. must @ least correct header in case.

using beautifulsoup:

# pass in explicit encoding if set header encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else none content = resp.content soup = beautifulsoup(content, from_encoding=encoding) if soup.original_encoding != 'utf-8':     meta = soup.select_one('meta[charset], meta[http-equiv="content-type"]')     if meta:         # replace meta charset info before re-encoding         if 'charset' in meta.attrs:             meta['charset'] = 'utf-8'         else:             meta['content'] = 'text/html; charset=utf-8'     # re-encode utf-8     content = soup.prettify()  # encodes utf-8 default 

similarly, other document standards may specify specific encodings; xml example utf-8 unless specified <?xml encoding="..." ... ?> xml declaration, again part of document.


Comments