RSS Feed

How to extract zip file which contains filenames with SHIFT_JIS encoding in Ubuntu


If a zip file contains the filenames which are Japanese, the encoding normally is SHIFT_JIS especially Windows. To extract the files, normal “unzip” will not work. 7z is a good solution.

The following commands are done in terminal. Firstly, we need to change the LANG environment variable, because the default LANG is normally UTF-8. Since the filenames are SHIFT_JIS, which is not UTF-8, we need to change it.

LANG=ja_JP # Don't use UTF-8, use "export" if needed

Then,

7z x jp.zip    #extract the files and preserve the encoding

As a result, a list of  unreadable files are extracted. Then use convmv command to convert the filenames. Assuming all the files are in a same folder.

convmv --notest -f shift-jis -t utf8 *.* #convert all the filename to UTF8

If we don’t know the character encoding for the filenames, we can also use the iconv to check the encoding before extract them,

env LANG=ja_JP 7z l jp.zip | iconv -f SHIFT-JIS -t UTF8

Update (2011-05-16):
The easiest way is using the 7-zip through Wine and installing all the required fonts through winetricks.

About Allen Choong

A cognitive science student, a programmer, a philosopher, a Catholic.

20 responses »

  1. Please, fix you article:

    “7z a jp.zip” this don’t extract. Use the command “7z e” to extract.

    Reply
  2. “7z e” extracts without preserving paths, “7z x” extracts while preserving paths

    similar thing with convmv, you need “-r” switch for it do work recursively through all directories. Using “*” instead “*.*” also is a good idea, unlike with Windows, it doesn’t match files without an extension…

    Reply
  3. Thank you very much for this! Broken encoding was driving me totally insane!

    Reply
  4. unzip -O CP932 japanese_sjis.zip

    Reply
    • this is arch only. It can be compiled for other linux distro’s, but info-zip is a real pain in the ass. (I did it though). It’s the best option here.

      Reply
  5. Allemcch, thank you so much for posting this. I needed some Shift-JIS-encoded documents for work, and with your help I was able to read the filenames properly.

    Reply
  6. it works like a charm!

    Reply
  7. Pingback: Extracting files from zip which contains non-UTF8 filename in Linux | Allen's Blog 2.0

  8. Though this solution is excellent (including the ones in the comments with the sneaky -O option in unzip, neat!), I guess it will only work if you have this locale installed and/or created with locale.gen.

    Here, I have a Lubuntu installation with, for instance, does not even provide a ISO-8859-15 encoding by default: all is UTF-8 only. Wait, how do I know that?
    By typing

    locale -a

    Reply
  9. Hi Allen, Could you please give me the full command line using this statement?
    env LANG=ja_JP 7z l jp.zip | iconv -f SHIFT-JIS -t UTF8. You mention in your article that you can detect the encoding prior to extracting the file using 7-zip. I tried using 7za env LANG=ja_JP 7z l [file name].zip | iconv -f SHIFT-JIS -t UTF8 but that didn’t work. Thanks!

    Reply
  10. Hey, have you checked out The Unarchiver? It’s the only utility I had zero problems with when extracting Japanese archives. Just works.

    Reply
  11. Thanks, worked fine. I used -r flag for convmv to easily traverse all subdirectories of my extracted files.

    Reply
  12. Pingback: How to run UTAU on Linux – Kanru Hua's Website

  13. Kungenoverallahar

    Thanks for this post! I had a couple of (seemingly) broken ZIP
    archives that I had almost given up on before I found this.

    Though, I’ve run into another possible problem that seems to be caused
    by the different path separator in Windows vs in the ZIP standard:
    Windows uses the byte 0x5c (“\”), whereas ZIP uses the ordinary 0x2f
    (“/”) like you’d see in other non-Windows file systems.

    It seems like some compression programs naively convert between the
    two separators by replacing every “\” in the filename with “/”. This
    might seem like a good idea if you live in a purely ASCII world, but
    it actually causes problems when compressing Shift-JIS encoded
    filenames.

    The problem is that the byte 0x5c occurs in some multi-byte characters
    in Shift-JIS. For example, ソ (Katakana letter “so”) is represented as
    0x83 0x5c, so a faulty compressor program would replace this by 0x83
    0x2f, which is not a valid code point.

    Even worse, attempting to extract an archive with such a file name
    does not just give it a malformed name, but something more bizarre.
    Suppose we start with the following file: ドレミハソラチド.png
    The program misinterprets it as containing a path separator and saves it
    as [ド] [レ] [ミ] [ハ] 0x83 [/] [ラ] [チ] [ド] [.] [p] [n] [g]
    (where [c] is the (correct) Shift-JIS representation of character c).
    Now, since it contains a valid ZIP path separator, extracting it will
    create the following:
    a) a directory called ドレミハ (where is the byte 0x83,
    which doesn’t stand for any character on its own and is most likely
    displayed as an empty box)
    b) a file called ラチド.png inside the above directory

    Since the actual file name has been broken up into several files (and
    the byte 0x5c removed), convmv cannot fix it properly. In order to do
    that, you have to first turn the directory/file names back into a
    single file name by reversing the above steps.

    I couldn’t figure out how to do this by simply renaming the files
    (since my shell escaped every character into a pure clusterfuck that I
    couldn’t get the heads or tail of, and mv simply refused to accept
    whatever file names I gave it) so I broke out a hex editor, patched
    every instance of 0x5c mistakenly converted to 0x2f in the file names,
    extracted it like normal and followed the steps in your post to
    convert them into UTF-8. Worked a treat!

    Reply
  14. Kungenoverallahar

    I also put together a list of every character affected by the 0x5c => 0x2f conversion:

    |————+———–|
    | Code point | Character |
    |————+———–|
    | 81 5C | ― |
    | 83 5C | ソ |
    | 84 5C | Ы |
    | 89 5C | 噂 |
    | 8A 5C | 浬 |
    | 8B 5C | 欺 |
    | 8C 5C | 圭 |
    | 8D 5C | 構 |
    | 8E 5C | 蚕 |
    | 8F 5C | 十 |
    | 90 5C | 申 |
    | 91 5C | 曾 |
    | 92 5C | 箪 |
    | 93 5C | 貼 |
    | 94 5C | 能 |
    | 95 5C | 表 |
    | 96 5C | 暴 |
    | 97 5C | 予 |
    | 98 5C | 禄 |
    | 99 5C | 兔 |
    | 9A 5C | 喀 |
    | 9B 5C | 媾 |
    | 9C 5C | 彌 |
    | 9D 5C | 拿 |
    | 9E 5C | 杤 |
    | 9F 5C | 歃 |
    | E0 5C | 濬 |
    | E1 5C | 畚 |
    | E2 5C | 秉 |
    | E3 5C | 綵 |
    | E4 5C | 臀 |
    | E5 5C | 藹 |
    | E6 5C | 觸 |
    | E7 5C | 軆 |
    | E8 5C | 鐔 |
    | E9 5C | 饅 |
    | EA 5C | 鷭 |
    |————+———–|

    All of these will give you the same problem (outlined in my previous post).

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 176 other followers

%d bloggers like this: