Search code examples
c++stringdictionaryutfautocorrect

How to find space in a japanese string in C++?


I am working on an auto correct program for Japanese sentences and the missing character in the sentence would be represented as a space.

I am reading from 2 files...

Input file:

 はアビガイル
おはよう くん

Dictionary file:

私はアビガイル
おはよう花くん

The missing characters 私 and 花 are represented as a space

How can I find the space from the input file?

I tried lineFromFile.find(" ") but it returns trash since it is not the usual english characters. Also tried lineFromFile.find('\0x20') and lineFromFile.find(' ')

I also tried string lineFromFile = u8"あび" but u8 prefix gets an error "identifier 'u8' is undefined"

I am using C++, Visual Studio 2013, gcc 4.8.3 and my current code page is Unicode (UTF-8 with signature)

If you think this is a duplicate question, please comment the link to the same ANSWERED question

My plan is:

  1. Find the space from the line of the input file (return the spaceIndex)
  2. Save the line from the dictionary file in string temp
  3. Replace the character in spaceIndex in the variable temp will be
  4. Compare the line from the input file to temp
  5. Repeat until match is found or until eof of dictionary file

Please help, I have 3 days :'(


Solution

  • The missing characters 私 and 花 are represented as a space

    No they aren't. Looking at  はアビガイル in a hex editor shows that the first character is '\u3000' which is IDEOGRAPHIC SPACE not SPACE.

    So to find it you need to use find(u8"\u3000") or find("\xe3\x80\x80)

    If you're lucky and all the Japanese characters in your input files are encoded as three bytes in UTF-8 then you can treat them as having fixed positions in the strings and substitute blocks of three bytes from one string to another.