Search code examples
regexubuntudockersedlocale

How to use sed expression for substituting double width characters with single width


I want to replace certain double width characters found in a file with their single width equivalents using sed expression. This is not quite working as expected but is expressive of what I want to do (this is in a bash script): I have mixed in alphanumeric ranges with some others I can think of off hand, not sure if this needs to be separated into two different -e arguments based on if ranges, etc.

sed -e 's,[0-9a-zA-Z()【】-一],[0-9a-zA-Z\(\)\[\]\-\-],g' ${file} > ${file}.cleaned

The files are tsv (tab separated values) text files. According to the file command the type is: UTF-8 Unicode text, with CRLF line terminators or (in another case) UTF-8 Unicode text, with no line terminators

Sample input:

Part Number
123-956-AA
343-213-【E】
XTE-898一(5)

Sample output:

Part Number
123-956-AA
343-213-[E]
XTE-898-(5)

My system is Ubuntu16.04 running in a Docker container built from our base image which is built from phusion/passenger-ruby23:0.9.19 which has a base image (eventually to the base) of ubuntu:16.04, shell is GNU bash, version 4.3.46(1)-release (x86_64-pc-linux-gnu), sed version is sed (GNU sed) 4.2.2 and results from locale command is:

LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

Update:

The chosen solution/answer was 1) to use the y command (as the other answers also suggested) and in my case, 2) to set the LL_ALL as shown below to avoid the error I was getting with the y command. It does appear that the range doesn't work for the y command so all characters must be identified individually (as I previously mistakenly thought)

LC_ALL=en_US.UTF-8 sed 'y/abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890()【】-一/abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890()[]--' file.tsv

Update 2:

Per the suggestion from the other answerers (one has mysteriously vanished), the locale being set for the system was further investigated as a solution instead of setting the environment variable at the command line. Since this is a Docker image container environment, I've found a solution to put into our base image which solves the problem at the base system level.

I've added to our base Dockerfile:

# Set the locale
RUN locale-gen en_US.UTF-8
ENV LANG='en_US.UTF-8' LANGUAGE='en_US:en' LC_ALL='en_US.UTF-8'

and now the locale command produces;

LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

and now the sed command works as follows:

sed 'y/abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890()【】-一/abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890()[]--' file.tsv

As a side note, I wish stackoverflow provided a way to give answer credit to multiple answers since the original 3 answers (again, one vanished) all contributed me getting to the solution but I had to choose only one. This happens often.


Solution

  • If perl is okay:

    $ perl -Mopen=locale -Mutf8 -pe 'tr/0-9a-zA-Z()【】-一/0-9a-zA-Z()[]--/' ip.txt
    Part Number
    123-956-AA
    343-213-[E]
    XTE-898-(5)
    
    • -Mopen=locale -Mutf8 to specify locale as utf8
    • tr/0-9a-zA-Z()【】-一/0-9a-zA-Z()[]--/ translate characters as required, can also use y instead of tr


    sed (GNU sed) 4.2.2 can be used, but it doesn't support ranges

    $ # simulating OP's POSIX locale
    $ echo '91A9foo' | LC_ALL=C sed 'y/A9/A9/'
    sed: -e expression #1, char 12: strings for `y' command are different lengths
    
    $ # changing to a utf8 locale
    $ echo '91A9foo' | LC_ALL=en_US.UTF-8 sed 'y/A9/A9/'
    91A9foo
    

    Further reading: https://wiki.archlinux.org/index.php/locale