html regex escaping autohotkey apostrophe

Regex match of apostrophe in autohotkey script

I have an autohotkey script which looks up a word in a bilingual dictionary when I double click any word on a webpage. If I click on something like "l'homme" the l' is copied into the clipboard as well as the homme. I want the autohotkey script to strip out everything up to and including the apostrophe.

I can't get autohotkey to match the apostrophe. Below is a sample script which prints out the ascii values of the first four characters. If I double click "l'homme" on this page, it prints out: 108,8217,104,111. The second character is clearly not the ascii code for an apostrophe. I think it's most probably something to do with the HTML representation of an apostrophe, but I haven't been able to get to the bottom of it. I've tried using autohotkey's transform, HTML function without any luck.

I've tried both the Unicode and non-Unicode versions of autohotkey. I've saved the script in UTF-8.

#Persistent
return
OnClipboardChange:
;debugging info:
c1 := Asc(SubStr(clipboard,1,1))
c2 := Asc(SubStr(clipboard,2,1))
c3 := Asc(SubStr(clipboard,3,1))
c4 := Asc(SubStr(clipboard,4,1))
Msgbox 0,info, char1: %c1% `nchar2: %c2% `nchar3: %c3% `nchar4: %c4%

;the line below is what I want to use, but it doesn't find a match
 stripToApostrophe:= RegExReplace(clipboard,".*’")

Solution

There is the standard quote ' and there is the "curling" quote ’.

Your regex might have to be

.*['’]

to cover both cases.

Maybe you'd like to make it non-greedy, too, if a word can have more than one apostrophe and you only want to remove the first:

.*?['’]

EDIT:

Interesting. I tried this:

w1 := "l’homme"
w2 := "l'homme"
c1 := Asc(SubStr(w1,2,1))
c2 := Asc(SubStr(w2,2,1))
v1 := RegExReplace(w1, ".*?['’]")
v2 := RegExReplace(w2, ".*?['’]")
MsgBox 0,info, %c1% - %c2% - %v1% - %v2%
return

And got back 146 - 39 - homme - homme. I'm editing from Notepad. Is it possible that our regex, while we think we're typing 8217, actually has 146 upon our pasting?

EDIT:

Apparently unicode support was added only for AutoHotkey_L. Using it, I believe the correct regex should be either

".*?[\x{0027}\x{0092}\x{2019}]"

".*?(" Chr(0x0027) "|" Chr(0x0092) "|" Chr(0x2019) ")"