C# reading PDF with ASCII chars -

i reading pdf file itext library version 4.1.6 , works fine. when read pdfs created pdf print driver (with print functionallity on ms word) ascii chars , cant convert them correctly. of pdf signs converted correctly tokens bt (begin text), et (end text) etc. when comes text objects stored in pdf array (from pdf iso, not in c# code!) single chars have strange values. e.g. have 'r', in bytes has value '1'. in ascii table r '82' (dec). values 'soh' char. other libraries can somehow convert this. can please tell me how can convert single byte it's letter 'r'? searched hours , nothing works till now.

here recent code how read pdf files (itext v. 4.1.6)

public string extractpuretext(string filename)     {         stringbuilder sb = new stringbuilder();          // create reader given pdf file         pdfreader reader = new pdfreader(filename);          int totallen = 68;         float charunit = ((float)totallen) / (float)reader.numberofpages;          (int page = 1; page <= reader.numberofpages; page++)         {             sb.appendline(extractpuretextfrompdfbytes(reader.getpagecontent(page), page) + " ");         }          return sb.tostring();     }

here extractpuretextfrompdfbytes function

public string extractpuretextfrompdfbytes(byte[] input, int pagenumber)     {         if (input == null || input.length == 0) return "";          int readposition = 0;         encoding enc = new unicodeencoding(true, false);          try         {             string resultstring = "";              // flag showing if we inside text object             bool intextobject = false;              // flag showing if next character literal              // e.g. '\\' '\' character or '\(' '('             bool nextliteral = false;              // () bracket nesting level. text appears inside ()             int bracketdepth = 0;              // keep previous chars extract numbers etc.:             char[] previouscharacters = new char[_numberofcharstokeep];             (int j = 0; j < _numberofcharstokeep; j++) previouscharacters[j] = ' ';               (readposition = 0; readposition < input.length; readposition++)             {                 char c = (char)input[readposition];                 if (input[readposition] == 213)                     c = "'".tochararray()[0];                   if (intextobject)                 {                     byte[] b = new byte[2];                     b[0] = 0;                     b[1] = input[116];                      byte[] d = new byte[1];                     d[0] = input[116];                     string bstring = system.text.encoding.ascii.getstring(b);                      if (readposition >= 114)                     {                         string t = new string((char)(input[116] & 0xff), 1);                     }                     // position text                     if (bracketdepth == 0)                     {                         if (checktoken(new string[] { "td", "td", "'", "t*", "\"", "tj", "tj", "tf" }, previouscharacters))                         {                             resultstring += system.environment.newline;                         }                     }                      // end of text object, go new line.                     if (bracketdepth == 0 && checktoken(new string[] { "et" }, previouscharacters))                     {                         resultstring += system.environment.newline;                         intextobject = false;                     }                     else                     {                         // start outputting text                         if ((c == '(') && (bracketdepth == 0) && (!nextliteral))                         {                             bracketdepth = 1;                         }                         else                         {                             // stop outputting text                             if ((c == ')') && (bracketdepth == 1) && (!nextliteral))                             {                                 bracketdepth = 0;                             }                             else                             {                                 // normal text character:                                 if (bracketdepth == 1)                                 {                                     // print out next character no matter what.                                      // not interpret.                                     if (c == '\\' && !nextliteral)                                     {                                         //resultstring += c.tostring();                                         nextliteral = true;                                     }                                     else                                     {                                         if (((c >= ' ') && (c <= '~')) ||                                             ((c >= 128) && (c < 255)))                                         {                                             //resultstring += c.tostring();                                         }                                          nextliteral = false;                                     }                                 }                             }                         }                     }                 }                  resultstring += c.tostring();                  // store recent characters                  // when have go checking                 (int j = 0; j < _numberofcharstokeep - 1; j++)                 {                     previouscharacters[j] = previouscharacters[j + 1];                 }                 previouscharacters[_numberofcharstokeep - 1] = c;                  // start of text object                 if (!intextobject && checktoken(new string[] { "bt" }, previouscharacters))                 {                     intextobject = true;                     resultstring += system.environment.newline;                     resultstring += pagenumber.tostring() + " pn" + system.environment.newline;                 }             }              string output = string.empty;              // clean text, remove empty lines , trim lines             using (stringreader reader = new stringreader(resultstring))             {                 string line;                 while ((line = reader.readline()) != null)                 {                     line = line.trim();                     if (line != string.empty)                     {                         output += line + system.environment.newline;                     }                 }             }              return output;         }         catch         {             return "";         }     }

it me absolutely no option higher version of itext because of lincensing. if there library has developer license, itext license have paid every machine own software installed. unfortunately no option me. help

Club Open

Search This Blog

C# reading PDF with ASCII chars -

Comments

Post a Comment