RSS Feed

Extracting files from zip which contains non-UTF8 filename in Linux


Previously, I have made a post about extracting SHIFT_JIS filename encoding in zip file.

However, this method does not work when the filenames contain GBK (simplified Chinese) encoding. As a result, I found a general solution for the non-UTF8 encoding.

The method is almost the same, but more generic way.

Firstly, the problem we face is after extracting the files, the filenames are unreadable. Not only that, we cannot convert the filename even we are using “convmv”, “iconv”, or even “uconv”. This is normally caused by our OS locale setting. To make our OS (Linux) more generic to read almost any languages (East Asia languages, right-to-left langues, etc), our OS is normally has the UTF8 locale. It may be en_US.UTF8, ja_JP.UTF8, zh_CN.UTF8, zh_TW.UTF8, en_GB.UTF8, etc.

The problem we extract the non-UTF8 filenames from zip in UTF8 environment will cause our filenames irrecoverable. This is because the non-UTF8 is write as UTF8 without any conversion. That is why, when we want to convert the filenames with these mojibake, it is always fail, no matter we are in UTF8 or non-UTF8 environment. Because we are not using the correct encoding when extracting the file.

But (in my opinion) there is no extraction tool allows conversion of the filename. As a result, we need to preserve the default encoding from the files, to write in our UTF8 environment.

In my old post, I use a non-UTF8 language, that is ja_JP for extracting SHIFT_JIS. Thus, a more generic way is extract our filename with LANG=”C”. That is ANSI C language without any UTF8 encoding.

env LANG=C 7z x file.zip

As a result, you will see the filenames have a lot of question marks.

Then now, we can convert the file with our real encoding.

convmv -f gbk -t utf8 --notest -r * #for filename which is GBK coded

About Allen Choong

A cognitive science student, a programmer, a philosopher, a Catholic.

4 responses »

  1. Thank you for your post, Allen! For the past several years, I have had to load a Windows VM every time I wanted to unzip an archive containing CJK characters (in UTF-16 LE format) created in a Windows environment.

    Using your tutorial, however, I was able to convert broken Korean file names into UTF-8 using the following:

    convmv -f euc-kr -t utf8 –notest -r /path/to/extracted/folder

    Reply
  2. there exists an unzip alternative “unzip-rcc” (at least on opensuse) which can handle other encodings (although you still have to manually convert wrong characters after extraction)

    Reply
  3. Thank you very much Allen. It saved my day. Although…. the contents of the files is Chinese as well. But Google Translate sheds some light.

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: