Search code examples
phpstringdelimitersubstring

How to get substrings on both sides of hyphen and trailing substring?


I am currently working on a web app which is using a specific string to call a function. Here is a sample string:

$string = "translate from-to word for translate"

First I need to validate the string, and it should be like the above $string. How should I validate the string?

Then I need to extract 3 substrings from $string.

  • The word that precedes the hyphen. (To be named: $target)
  • The word that follows the hyphen. (To be named: $source)
  • The text (not including the first space) that follows $source to the end of the string. (To be named: $translate)

This is my coding attempt to get the from and to:

$found = false;
$source ="";
$target = "";
$next = 3;
$prev = 1;
for($i=0;$i<strlen($string);$i++){
    if($found== false){
        if($string[$i] == "-"){
            $found = true;
            while($string[$i+$prev] != " "){
                $target .= $string[$i+$prev];
                $prev +=1;
            }
            /*$next -=1;
            while($string[$i-$next] != " " && $next > 0){
                $source .= $string[$i-$next];
                $next -=1;
            }*/
        }
    }
}

From that code, I only can return the $target which contains to after -.
I don't know how to get $source.

Please show me the fastest way to get the from as $source and to as $target.

Then I need to get word for translate (all of the string after from-to).

So the result should be

$target = "to";
$source = "from";
$translate = "word for translate";

Finally, if the $string has two hyphens, like translate from-to from-to test-test word for translate, it should be return false;

note to and from are random strings.


Solution

  • Consider the following possible input strings:

    • translate from-to word for translate (1 hyphen, no accents or non-English characters)
    • translate dari-ke dari-ke word for translate (2 hyphens)
    • translate clé-solution word for translate (1 hyphen, accented character used)
    • translate goodbye-さようなら word for translate (1 hyphen , Japanese characters used)

    A case-insensitive pattern like: /^[a-z]+? ([a-z]+)-([a-z]+?) ([a-z ]+)$/i will perform as requested on the first two sample strings with high efficiency, but not the last two.

    Using the "word character" (\w) to match the substrings (instead of case-insensitive [a-z]) will perform as intended with the first two samples with, but also allows 0-9 and _ as valid characters. This means a slight drop in pattern accuracy (this may be of no noticeable consequence to your project).

    If you are translating strings that may go beyond English characters, it can be simpler / more forgiving to use a "negated character class" for matching. If you want to allow letters beyond a-z, like accented and other multibyte characters, then [^-] will offer a broad allowance of characters (at the expense of allowing many unwanted letters too). Here is a demo of this kind of pattern.

    It is important to only write "capture groups" for substrings that you want to subsequently use. For this reason, I do not capture the leading substring translate.

    list() is a handy "language construct" to assign variable names to array values. Notice that the first element (the fullstring match) is not assigned to a variable. This is why list()'s parameters starts with ,. If you don't wish to leverage the convenience of list(), then you can manually assign the three variable names over three lines like this:

    $source=$out[1];
    $target=$out[2];
    $translate=$out[3];
    

    Code: (Demo)

    $strings=[
        "translate from-to word for translate",
        "translate dari-ke dari-ke word for translate",
        "translate clé-solution word for translate",
        "translate goodbye-さようなら word for translate"
    ];
    
    foreach($strings as $string){
        if(preg_match('/^[a-z]+? ([^-]+)-([^-]+?) ([a-z ]+)$/i',$string,$out)){
            list(,$source,$target,$translate)=$out;
            echo "source=$source; target=$target; translate=$translate";
        }else{
            var_export(false);  // $found=false;
        }
        echo "<br>";
    }
    

    Output:

    source=from; target=to; translate=word for translate
    false
    source=clé; target=solution; translate=word for translate
    source=goodbye; target=さようなら; translate=word for translate
    

    While regex provides a much more concise method with fewer function calls, this is a non-regex method:

    if(substr_count($string,'-')!=1){
        var_export(false);  // $found=false;
    }else{
        $trimmed=ltrim($string,'translate ');
        $array=explode(' ',$trimmed,2);
        list($source,$target)=explode('-',$array[0]);
        $translate=$array[1];
        echo "source=$source; target=$target; translate=$translate";
    }