i reading pdf file itext library version 4.1.6 , works fine. when read pdfs created pdf print driver (with print functionallity on ms word) ascii chars , cant convert them correctly. of pdf signs converted correctly tokens bt (begin text), et (end text) etc. when comes text objects stored in pdf array (from pdf iso, not in c# code!) single chars have strange values. e.g. have 'r', in bytes has value '1'. in ascii table r '82' (dec). values 'soh' char. other libraries can somehow convert this. can please tell me how can convert single byte it's letter 'r'? searched hours , nothing works till now.
here recent code how read pdf files (itext v. 4.1.6)
public string extractpuretext(string filename) { stringbuilder sb = new stringbuilder(); // create reader given pdf file pdfreader reader = new pdfreader(filename); int totallen = 68; float charunit = ((float)totallen) / (float)reader.numberofpages; (int page = 1; page <= reader.numberofpages; page++) { sb.appendline(extractpuretextfrompdfbytes(reader.getpagecontent(page), page) + " "); } return sb.tostring(); }
here extractpuretextfrompdfbytes function
public string extractpuretextfrompdfbytes(byte[] input, int pagenumber) { if (input == null || input.length == 0) return ""; int readposition = 0; encoding enc = new unicodeencoding(true, false); try { string resultstring = ""; // flag showing if we inside text object bool intextobject = false; // flag showing if next character literal // e.g. '\\' '\' character or '\(' '(' bool nextliteral = false; // () bracket nesting level. text appears inside () int bracketdepth = 0; // keep previous chars extract numbers etc.: char[] previouscharacters = new char[_numberofcharstokeep]; (int j = 0; j < _numberofcharstokeep; j++) previouscharacters[j] = ' '; (readposition = 0; readposition < input.length; readposition++) { char c = (char)input[readposition]; if (input[readposition] == 213) c = "'".tochararray()[0]; if (intextobject) { byte[] b = new byte[2]; b[0] = 0; b[1] = input[116]; byte[] d = new byte[1]; d[0] = input[116]; string bstring = system.text.encoding.ascii.getstring(b); if (readposition >= 114) { string t = new string((char)(input[116] & 0xff), 1); } // position text if (bracketdepth == 0) { if (checktoken(new string[] { "td", "td", "'", "t*", "\"", "tj", "tj", "tf" }, previouscharacters)) { resultstring += system.environment.newline; } } // end of text object, go new line. if (bracketdepth == 0 && checktoken(new string[] { "et" }, previouscharacters)) { resultstring += system.environment.newline; intextobject = false; } else { // start outputting text if ((c == '(') && (bracketdepth == 0) && (!nextliteral)) { bracketdepth = 1; } else { // stop outputting text if ((c == ')') && (bracketdepth == 1) && (!nextliteral)) { bracketdepth = 0; } else { // normal text character: if (bracketdepth == 1) { // print out next character no matter what. // not interpret. if (c == '\\' && !nextliteral) { //resultstring += c.tostring(); nextliteral = true; } else { if (((c >= ' ') && (c <= '~')) || ((c >= 128) && (c < 255))) { //resultstring += c.tostring(); } nextliteral = false; } } } } } } resultstring += c.tostring(); // store recent characters // when have go checking (int j = 0; j < _numberofcharstokeep - 1; j++) { previouscharacters[j] = previouscharacters[j + 1]; } previouscharacters[_numberofcharstokeep - 1] = c; // start of text object if (!intextobject && checktoken(new string[] { "bt" }, previouscharacters)) { intextobject = true; resultstring += system.environment.newline; resultstring += pagenumber.tostring() + " pn" + system.environment.newline; } } string output = string.empty; // clean text, remove empty lines , trim lines using (stringreader reader = new stringreader(resultstring)) { string line; while ((line = reader.readline()) != null) { line = line.trim(); if (line != string.empty) { output += line + system.environment.newline; } } } return output; } catch { return ""; } }
it me absolutely no option higher version of itext because of lincensing. if there library has developer license, itext license have paid every machine own software installed. unfortunately no option me. help
Comments
Post a Comment