How to extract zip file which contains filenames with SHIFT_JIS encoding in Ubuntu

If a zip file contains the filenames which are Japanese, the encoding normally is SHIFT_JIS especially Windows. To extract the files, normal “unzip” will not work. 7z is a good solution.

The following commands are done in terminal. Firstly, we need to change the LANG environment variable, because the default LANG is normally UTF-8. Since the filenames are SHIFT_JIS, which is not UTF-8, we need to change it.

LANG=ja_JP # Don't use UTF-8, use "export" if needed


7z x    #extract the files and preserve the encoding

As a result, a list of  unreadable files are extracted. Then use convmv command to convert the filenames. Assuming all the files are in a same folder.

convmv --notest -f shift-jis -t utf8 *.* #convert all the filename to UTF8

If we don’t know the character encoding for the filenames, we can also use the iconv to check the encoding before extract them,

env LANG=ja_JP 7z l | iconv -f SHIFT-JIS -t UTF8

Update (2011-05-16):
The easiest way is using the 7-zip through Wine and installing all the required fonts through winetricks.

23 thoughts on “How to extract zip file which contains filenames with SHIFT_JIS encoding in Ubuntu

  1. “7z e” extracts without preserving paths, “7z x” extracts while preserving paths

    similar thing with convmv, you need “-r” switch for it do work recursively through all directories. Using “*” instead “*.*” also is a good idea, unlike with Windows, it doesn’t match files without an extension…

    1. this is arch only. It can be compiled for other linux distro’s, but info-zip is a real pain in the ass. (I did it though). It’s the best option here.

  2. Allemcch, thank you so much for posting this. I needed some Shift-JIS-encoded documents for work, and with your help I was able to read the filenames properly.

  3. Though this solution is excellent (including the ones in the comments with the sneaky -O option in unzip, neat!), I guess it will only work if you have this locale installed and/or created with locale.gen.

    Here, I have a Lubuntu installation with, for instance, does not even provide a ISO-8859-15 encoding by default: all is UTF-8 only. Wait, how do I know that?
    By typing

    locale -a

  4. Hi Allen, Could you please give me the full command line using this statement?
    env LANG=ja_JP 7z l | iconv -f SHIFT-JIS -t UTF8. You mention in your article that you can detect the encoding prior to extracting the file using 7-zip. I tried using 7za env LANG=ja_JP 7z l [file name].zip | iconv -f SHIFT-JIS -t UTF8 but that didn’t work. Thanks!

    1. Hi George, that is the full command.

      env LANG=ja_JP 7z l | iconv -f SHIFT-JIS -t UTF8

      It is not “detect”, but to “check”. That means we need to have a prior knowledge what is the encoding of the filename before we check it by using iconv. If your filename is GBK instead of SHIFT-JIS, then you need to use iconv -f GBK.

      Please read another article also

    1. Many thanks! At last, a very simple utility that *just works*, as you say – I used it on Windows 10, and avoided all the ‘default language for non-Unicode apps’ or apploc rigmarole.

  5. Thanks, worked fine. I used -r flag for convmv to easily traverse all subdirectories of my extracted files.

  6. Thanks for this post! I had a couple of (seemingly) broken ZIP
    archives that I had almost given up on before I found this.

    Though, I’ve run into another possible problem that seems to be caused
    by the different path separator in Windows vs in the ZIP standard:
    Windows uses the byte 0x5c (“\”), whereas ZIP uses the ordinary 0x2f
    (“/”) like you’d see in other non-Windows file systems.

    It seems like some compression programs naively convert between the
    two separators by replacing every “\” in the filename with “/”. This
    might seem like a good idea if you live in a purely ASCII world, but
    it actually causes problems when compressing Shift-JIS encoded

    The problem is that the byte 0x5c occurs in some multi-byte characters
    in Shift-JIS. For example, ソ (Katakana letter “so”) is represented as
    0x83 0x5c, so a faulty compressor program would replace this by 0x83
    0x2f, which is not a valid code point.

    Even worse, attempting to extract an archive with such a file name
    does not just give it a malformed name, but something more bizarre.
    Suppose we start with the following file: ドレミハソラチド.png
    The program misinterprets it as containing a path separator and saves it
    as [ド] [レ] [ミ] [ハ] 0x83 [/] [ラ] [チ] [ド] [.] [p] [n] [g]
    (where [c] is the (correct) Shift-JIS representation of character c).
    Now, since it contains a valid ZIP path separator, extracting it will
    create the following:
    a) a directory called ドレミハ (where is the byte 0x83,
    which doesn’t stand for any character on its own and is most likely
    displayed as an empty box)
    b) a file called ラチド.png inside the above directory

    Since the actual file name has been broken up into several files (and
    the byte 0x5c removed), convmv cannot fix it properly. In order to do
    that, you have to first turn the directory/file names back into a
    single file name by reversing the above steps.

    I couldn’t figure out how to do this by simply renaming the files
    (since my shell escaped every character into a pure clusterfuck that I
    couldn’t get the heads or tail of, and mv simply refused to accept
    whatever file names I gave it) so I broke out a hex editor, patched
    every instance of 0x5c mistakenly converted to 0x2f in the file names,
    extracted it like normal and followed the steps in your post to
    convert them into UTF-8. Worked a treat!

  7. I also put together a list of every character affected by the 0x5c => 0x2f conversion:

    | Code point | Character |
    | 81 5C | ― |
    | 83 5C | ソ |
    | 84 5C | Ы |
    | 89 5C | 噂 |
    | 8A 5C | 浬 |
    | 8B 5C | 欺 |
    | 8C 5C | 圭 |
    | 8D 5C | 構 |
    | 8E 5C | 蚕 |
    | 8F 5C | 十 |
    | 90 5C | 申 |
    | 91 5C | 曾 |
    | 92 5C | 箪 |
    | 93 5C | 貼 |
    | 94 5C | 能 |
    | 95 5C | 表 |
    | 96 5C | 暴 |
    | 97 5C | 予 |
    | 98 5C | 禄 |
    | 99 5C | 兔 |
    | 9A 5C | 喀 |
    | 9B 5C | 媾 |
    | 9C 5C | 彌 |
    | 9D 5C | 拿 |
    | 9E 5C | 杤 |
    | 9F 5C | 歃 |
    | E0 5C | 濬 |
    | E1 5C | 畚 |
    | E2 5C | 秉 |
    | E3 5C | 綵 |
    | E4 5C | 臀 |
    | E5 5C | 藹 |
    | E6 5C | 觸 |
    | E7 5C | 軆 |
    | E8 5C | 鐔 |
    | E9 5C | 饅 |
    | EA 5C | 鷭 |

    All of these will give you the same problem (outlined in my previous post).

  8. I have a list of files and I want to write a script to extract all of them.

    my issue is that when I do

    find . -iname “*.zip” -print

    yield NOTHING.

    any way to iterate through a list of zip files without using find?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s