Sunday, May 1, 2011

asp.net converting iso-8859 file to utf-8

Hi, I need to convert a CSV file from iso to UTF-8 to keep the accents in the database.

French accents (é,è,ê, and the like) are not kept when I try to translate them to UTF-8, they are changed to "?".

I'm stumped.

I use the following function for the translation:

public static string iso8859ToUnicode(string src) {

        Encoding iso = Encoding.GetEncoding("iso8859-1");

        Encoding unicode = Encoding.UTF8;        

        byte[] isoBytes = iso.GetBytes(src);

        byte[] unibytes = Encoding.Convert(iso,unicode,isoBytes);

        char[] unichars = new char[iso.GetCharCount(unibytes,0,unibytes.Length)];

        unicode.GetChars(unibytes,0,unibytes.Length,unichars,0);

        return new string(unichars);

    }

But it doesn't seem to work well. Help?

From stackoverflow
  • you might be loosing your encoding when you declare the new string, or when you store the data in the char array

    MrZombie : I shouldn't be losing the encoding that way, as I'm converting the iso to bytes, then the bytes to utf-8... Unless there is byte-level automatic character conversion that I'm not aware of, it shouldn't be the problem.
  • Instead of the GetChars() method, can't you just call

    unicode.GetString(unibytes);
    
  • I strongly suspect that your original string doesn't have the correct values. My guess is that you've read it from the file as if it were UTF-8.

    To convert between two encodings, you shouldn't have the string in the first place - you should basically load the bytes of the file and call Encoding.Convert() that way. Alternatively, load the file using ISO-Latin-1 and just save it as UTF-8. For example:

    public static void ConvertLatin1ToUtf8(string inputFile, string outputFile)
    {
        Encoding latin1 = Encoding.GetEncoding(28591);
        string text = File.ReadAllText(inputFile, latin1);
        File.WriteAllText(outputFile, text, Encoding.UTF8);
    }
    

    or

    public static void ConvertLatin1ToUtf8(string inputFile, string outputFile)
    {
        Encoding latin1 = Encoding.GetEncoding(28591);
        byte[] latinBytes = File.ReadAllBytes(inputFile);
        byte[] utf8Bytes = Encoding.Convert(latin1, Encoding.UTF8, latinBytes);
        File.WriteAllBytes(outputFile, utf8Bytes);
    }
    
    MrZombie : Thank you a million times and a half. Is it okay for me to hate encoding issues? :P
    Jon Skeet : Only if I can hate time zone issues more :)

0 comments:

Post a Comment