Reading a unicode Excel file in PHP

It's easy to save an Excel file as CSV and read it in PHP with the fgetcsv function but this may not work so well if the file contains non-English characters.

Excel uses a non-standard character encoding for csv files.

You can save an Excel file as 'unicode' text however there are several unicode systems - Windows uses UTF-16, and PHP uses UTF-8.

To open the 'unicode text' file in PHP you have to convert it, in addition you may want to be able to open UTF-8 files that may be created by other systems.

PHP has an encoding detection function - but it can't detect UTF-16.

I've solved the problem with the following function which detects from several encodings, adds an appropriate filter, and returns a filehandle which reads as UTF-8.

<?php
function fopen_utf8($filename){
   
$encoding='';
   
$handle = fopen($filename, 'r');
   
$bom = fread($handle, 2);
//    fclose($handle);
   
rewind($handle);
   

    if(
$bom === chr(0xff).chr(0xfe)  || $bom === chr(0xfe).chr(0xff)){
           
// UTF16 Byte Order Mark present
           
$encoding = 'UTF-16';
    } else {
       
$file_sample = fread($handle, 1000) + 'e'; //read first 1000 bytes
        // + e is a workaround for mb_string bug
       
rewind($handle);
   
       
$encoding = mb_detect_encoding($file_sample , 'UTF-8, UTF-7, ASCII, EUC-JP,SJIS, eucJP-win, SJIS-win, JIS, ISO-2022-JP');
    }
    if (
$encoding){
       
stream_filter_append($handle, 'convert.iconv.'.$encoding.'/UTF-8');
    }
    return  (
$handle);
}
?>

Tags

Comments

Thanks

I was trying to figure this one out importing csv and tab files from different systems and of course different encodings.

Thank you.

Comments

Thanks

I have been trying to import Unicode csv files from Excel for ages. This function was a great help. Thanks

Comments

Excellent

That's really a good idea, added to my function collection. Really helps.

Comments

What is $line doing in

What is $line doing in $encoding = mb_detect_encoding($line , 'UTF-8, UTF-7, ASCII, EUC-JP,SJIS, eucJP-win, SJIS-win, JIS, ISO-2022-JP'); ?

Comments

correction

I'd put $line where it should have been $file_sample.

Now corrected - thanks

Comments

太感謝了,使用你的

太感謝了,使用你的函數,配合 fgets() 讀取 Windows 系統的 Unicode 的檔案,完全沒問題啊

Comments

great function

Thanks a lot!

Comments

Thanks!

Top stuff mate - great effort and works like a treat. Thanks for sharing your knowledge! Jug on me if you ever make it to Fremantle, WA

Post new comment

Got something to add - just enter a comment
all other fields are optional.

Your email address will not be published.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Copy the characters (respecting upper/lower case) from the image.