Previously, I have made a post about extracting SHIFT_JIS filename encoding in zip file.
However, this method does not work when the filenames contain GBK (simplified Chinese) encoding. As a result, I found a general solution for the non-UTF8 encoding.
The method is almost the same, but more generic way.
Firstly, the problem we face is after extracting the files, the filenames are unreadable. Not only that, we cannot convert the filename even we are using “convmv”, “iconv”, or even “uconv”. This is normally caused by our OS locale setting. To make our OS (Linux) more generic to read almost any languages (East Asia languages, right-to-left langues, etc), our OS is normally has the UTF8 locale. It may be en_US.UTF8, ja_JP.UTF8, zh_CN.UTF8, zh_TW.UTF8, en_GB.UTF8, etc.
The problem we extract the non-UTF8 filenames from zip in UTF8 environment will cause our filenames irrecoverable. This is because the non-UTF8 is write as UTF8 without any conversion. That is why, when we want to convert the filenames with these mojibake, it is always fail, no matter we are in UTF8 or non-UTF8 environment. Because we are not using the correct encoding when extracting the file.
But (in my opinion) there is no extraction tool allows conversion of the filename. As a result, we need to preserve the default encoding from the files, to write in our UTF8 environment.
In my old post, I use a non-UTF8 language, that is ja_JP for extracting SHIFT_JIS. Thus, a more generic way is extract our filename with LANG=”C”. That is ANSI C language without any UTF8 encoding.
env LANG=C 7z x file.zip
As a result, you will see the filenames have a lot of question marks.
Then now, we can convert the file with our real encoding.
convmv -f gbk -t utf8 --notest -r * #for filename which is GBK coded