If a zip file contains the filenames which are Japanese, the encoding normally is SHIFT_JIS especially Windows. To extract the files, normal “unzip” will not work. 7z is a good solution.
The following commands are done in terminal. Firstly, we need to change the LANG environment variable, because the default LANG is normally UTF-8. Since the filenames are SHIFT_JIS, which is not UTF-8, we need to change it.
LANG=ja_JP # Don't use UTF-8, use "export" if needed
Then,
7z x jp.zip #extract the files and preserve the encoding
As a result, a list of unreadable files are extracted. Then use convmv command to convert the filenames. Assuming all the files are in a same folder.
convmv --notest -f shift-jis -t utf8 *.* #convert all the filename to UTF8
If we don’t know the character encoding for the filenames, we can also use the iconv to check the encoding before extract them,
env LANG=ja_JP 7z l jp.zip | iconv -f SHIFT-JIS -t UTF8
Update (2011-05-16):
The easiest way is using the 7-zip through Wine and installing all the required fonts through winetricks.
Please, fix you article:
“7z a jp.zip” this don’t extract. Use the command “7z e” to extract.
Thanks, I didn’t notice that.
“7z e” extracts without preserving paths, “7z x” extracts while preserving paths
similar thing with convmv, you need “-r” switch for it do work recursively through all directories. Using “*” instead “*.*” also is a good idea, unlike with Windows, it doesn’t match files without an extension…
Thank you. I used that command because the file I was extracting did not have any path within.
Thank you very much for this! Broken encoding was driving me totally insane!
unzip -O CP932 japanese_sjis.zip
this is arch only. It can be compiled for other linux distro’s, but info-zip is a real pain in the ass. (I did it though). It’s the best option here.
Allemcch, thank you so much for posting this. I needed some Shift-JIS-encoded documents for work, and with your help I was able to read the filenames properly.
it works like a charm!
Though this solution is excellent (including the ones in the comments with the sneaky -O option in unzip, neat!), I guess it will only work if you have this locale installed and/or created with locale.gen.
Here, I have a Lubuntu installation with, for instance, does not even provide a ISO-8859-15 encoding by default: all is UTF-8 only. Wait, how do I know that?
By typing
locale -a
Hi Allen, Could you please give me the full command line using this statement?
env LANG=ja_JP 7z l jp.zip | iconv -f SHIFT-JIS -t UTF8. You mention in your article that you can detect the encoding prior to extracting the file using 7-zip. I tried using 7za env LANG=ja_JP 7z l [file name].zip | iconv -f SHIFT-JIS -t UTF8 but that didn’t work. Thanks!
Hi George, that is the full command.
env LANG=ja_JP 7z l jp.zip | iconv -f SHIFT-JIS -t UTF8
It is not “detect”, but to “check”. That means we need to have a prior knowledge what is the encoding of the filename before we check it by using iconv. If your filename is GBK instead of SHIFT-JIS, then you need to use iconv -f GBK.
Please read another article also https://allencch.wordpress.com/2013/04/15/extracting-files-from-zip-which-contains-non-utf8-filename-in-linux/
Hey, have you checked out The Unarchiver? It’s the only utility I had zero problems with when extracting Japanese archives. Just works.
I did the same as you, nekku:
sudo apt-get install unar
unar file.zip
and all the non-standard characters were perfectly readable!
Many thanks! At last, a very simple utility that *just works*, as you say – I used it on Windows 10, and avoided all the ‘default language for non-Unicode apps’ or apploc rigmarole.
Thanks, worked fine. I used -r flag for convmv to easily traverse all subdirectories of my extracted files.
Thanks for this post! I had a couple of (seemingly) broken ZIP
archives that I had almost given up on before I found this.
Though, I’ve run into another possible problem that seems to be caused
by the different path separator in Windows vs in the ZIP standard:
Windows uses the byte 0x5c (“\”), whereas ZIP uses the ordinary 0x2f
(“/”) like you’d see in other non-Windows file systems.
It seems like some compression programs naively convert between the
two separators by replacing every “\” in the filename with “/”. This
might seem like a good idea if you live in a purely ASCII world, but
it actually causes problems when compressing Shift-JIS encoded
filenames.
The problem is that the byte 0x5c occurs in some multi-byte characters
in Shift-JIS. For example, ソ (Katakana letter “so”) is represented as
0x83 0x5c, so a faulty compressor program would replace this by 0x83
0x2f, which is not a valid code point.
Even worse, attempting to extract an archive with such a file name
does not just give it a malformed name, but something more bizarre.
Suppose we start with the following file: ドレミハソラチド.png
The program misinterprets it as containing a path separator and saves it
as [ド] [レ] [ミ] [ハ] 0x83 [/] [ラ] [チ] [ド] [.] [p] [n] [g]
(where [c] is the (correct) Shift-JIS representation of character c).
Now, since it contains a valid ZIP path separator, extracting it will
create the following:
a) a directory called ドレミハ (where is the byte 0x83,
which doesn’t stand for any character on its own and is most likely
displayed as an empty box)
b) a file called ラチド.png inside the above directory
Since the actual file name has been broken up into several files (and
the byte 0x5c removed), convmv cannot fix it properly. In order to do
that, you have to first turn the directory/file names back into a
single file name by reversing the above steps.
I couldn’t figure out how to do this by simply renaming the files
(since my shell escaped every character into a pure clusterfuck that I
couldn’t get the heads or tail of, and mv simply refused to accept
whatever file names I gave it) so I broke out a hex editor, patched
every instance of 0x5c mistakenly converted to 0x2f in the file names,
extracted it like normal and followed the steps in your post to
convert them into UTF-8. Worked a treat!
I also put together a list of every character affected by the 0x5c => 0x2f conversion:
|————+———–|
| Code point | Character |
|————+———–|
| 81 5C | ― |
| 83 5C | ソ |
| 84 5C | Ы |
| 89 5C | 噂 |
| 8A 5C | 浬 |
| 8B 5C | 欺 |
| 8C 5C | 圭 |
| 8D 5C | 構 |
| 8E 5C | 蚕 |
| 8F 5C | 十 |
| 90 5C | 申 |
| 91 5C | 曾 |
| 92 5C | 箪 |
| 93 5C | 貼 |
| 94 5C | 能 |
| 95 5C | 表 |
| 96 5C | 暴 |
| 97 5C | 予 |
| 98 5C | 禄 |
| 99 5C | 兔 |
| 9A 5C | 喀 |
| 9B 5C | 媾 |
| 9C 5C | 彌 |
| 9D 5C | 拿 |
| 9E 5C | 杤 |
| 9F 5C | 歃 |
| E0 5C | 濬 |
| E1 5C | 畚 |
| E2 5C | 秉 |
| E3 5C | 綵 |
| E4 5C | 臀 |
| E5 5C | 藹 |
| E6 5C | 觸 |
| E7 5C | 軆 |
| E8 5C | 鐔 |
| E9 5C | 饅 |
| EA 5C | 鷭 |
|————+———–|
All of these will give you the same problem (outlined in my previous post).
I have a list of files and I want to write a script to extract all of them.
my issue is that when I do LANG=ja.jp
find . -iname “*.zip” -print
yield NOTHING.
any way to iterate through a list of zip files without using find?
Don’t understand why you yield NOTHING with a command
find . -iname “*.zip” -print
Because this command doesn’t related to LANG