Search code examples
phppreg-replacepreg-match-allstrip-tagsarray-unique

How do i remove duplicate links from a page except first


I have a problem with some contents, which have the same link again and again, so i want to remove all duplicate links except a single, have anyone idea how to do this????

here is my code which remove all links

function anchor_remover($page) {
    $filter_text = preg_replace("|<<blink>a *<blink>href=\<blink>"(.*)\">(.*)</a>|","\\2",$page); 
    return $filter_text; 
}

add_filter('the_content', 'anchor_remover');

basically i need this for wordpress, to filter the contents and remove duplicate links should have only a single link.


Solution

  • Using preg_replace_callback:

    <?php
    /*
     * vim: ts=4 sw=4 fdm=marker noet
     */
    $page = file_get_contents('./dupes.html');
    
    function do_strip_link($matches)
    {
            static $seen = array();
    
            if( in_array($matches[1], $seen) )
            {
                    return $matches[2];
            }
            else
            {
                    $seen[] = $matches[1];
                    return $matches[0];
            }
    }
    function strip_dupe_links($page)
    {
            return preg_replace_callback(
                    '|<a\s+href="(.*?)">(.*?)</a>|',
                    do_strip_link,
                    $page
            );
    }
    
    $page = strip_dupe_links($page);
    echo $page;
    

    Input:

    <html>
            <head><title>Hi!</title></head>
            <body>
                    <a href="foo.html">foo</a>
                    <a href="foo.html">foo</a>
                    <a href="foo.html">foo</a>
                    <a href="foo.html">foo</a>
                    <a href="foo.html">foo</a>
                    <a href="foo.html">foo</a>
                    <a href="foo.html">foo</a>
                    <a href="foo.html">foo</a>
                    <a href="foo.html">foo</a>
                    <a href="foo.html">foo</a>
                    <a href="bar.html">bar</a>
            </body>
    </html>
    

    Output:

    <html>
            <head><title>Hi!</title></head>
            <body>
                    <a href="foo.html">foo</a>
                    foo
                    foo
                    foo
                    foo
                    foo
                    foo
                    foo
                    foo
                    foo
                    <a href="bar.html">bar</a>
            </body>
    </html>