Search code examples
phptextsplitcategoriesmediawiki

Wikipedia-like list of all content pages


Wikipedia uses an "HTML sitemap" to link to every single content page. The huge amount of pages has to be split into lots of groups so that every page has a maximum of ca. 100 links, of course.

This is how Wikipedia does it:

Special: All pages

The whole list of articles is divided into several larger groups which are defined by their first and last word each:

  • "AAA rating" to "early adopter"
  • "earth" to "lamentation"
  • "low" to "priest"
  • ...

When you click one single category, this range (e.g. "earth" to "lamentation") is divided likewise. This procedure is repeated until the current range includes only ca. 100 articles so that they can be displayed.

I really like this approach to link lists which minimizes the number of clicks needed to reach any article.

How can you create such an article list automatically?

So my question is how one could automatically create such an index page which allows clicks to smaller categories until the number of articles contained is small enough to display them.

Imagine an array of all article names is given, how would you start to program an index with automatical category-splitting?

Array('AAA rating', 'abdicate', ..., 'zero', 'zoo')

It would be great if you could help me. I don't need a perfect solution but a useful approach, of course. Thank you very much in advance!

Edit: Found the part in Wikipedia's software (MediaWiki) now:

<?php
/**
 * Implements Special:Allpages
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License along
 * with this program; if not, write to the Free Software Foundation, Inc.,
 * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
 * http://www.gnu.org/copyleft/gpl.html
 *
 * @file
 * @ingroup SpecialPage
 */

/**
 * Implements Special:Allpages
 *
 * @ingroup SpecialPage
 */
class SpecialAllpages extends IncludableSpecialPage {

    /**
     * Maximum number of pages to show on single subpage.
     */
    protected $maxPerPage = 345;

    /**
     * Maximum number of pages to show on single index subpage.
     */
    protected $maxLineCount = 100;

    /**
     * Maximum number of chars to show for an entry.
     */
    protected $maxPageLength = 70;

    /**
     * Determines, which message describes the input field 'nsfrom'.
     */
    protected $nsfromMsg = 'allpagesfrom';

    function __construct( $name = 'Allpages' ){
        parent::__construct( $name );
    }

    /**
     * Entry point : initialise variables and call subfunctions.
     *
     * @param $par String: becomes "FOO" when called like Special:Allpages/FOO (default NULL)
     */
    function execute( $par ) {
        global $wgRequest, $wgOut, $wgContLang;

        $this->setHeaders();
        $this->outputHeader();
        $wgOut->allowClickjacking();

        # GET values
        $from = $wgRequest->getVal( 'from', null );
        $to = $wgRequest->getVal( 'to', null );
        $namespace = $wgRequest->getInt( 'namespace' );

        $namespaces = $wgContLang->getNamespaces();

        $wgOut->setPagetitle( 
            ( $namespace > 0 && in_array( $namespace, array_keys( $namespaces) ) ) ?
            wfMsg( 'allinnamespace', str_replace( '_', ' ', $namespaces[$namespace] ) ) :
            wfMsg( 'allarticles' )
        );

        if( isset($par) ) {
            $this->showChunk( $namespace, $par, $to );
        } elseif( isset($from) && !isset($to) ) {
            $this->showChunk( $namespace, $from, $to );
        } else {
            $this->showToplevel( $namespace, $from, $to );
        }
    }

    /**
     * HTML for the top form
     *
     * @param $namespace Integer: a namespace constant (default NS_MAIN).
     * @param $from String: dbKey we are starting listing at.
     * @param $to String: dbKey we are ending listing at.
     */
    function namespaceForm( $namespace = NS_MAIN, $from = '', $to = '' ) {
        global $wgScript;
        $t = $this->getTitle();

        $out  = Xml::openElement( 'div', array( 'class' => 'namespaceoptions' ) );
        $out .= Xml::openElement( 'form', array( 'method' => 'get', 'action' => $wgScript ) );
        $out .= Html::hidden( 'title', $t->getPrefixedText() );
        $out .= Xml::openElement( 'fieldset' );
        $out .= Xml::element( 'legend', null, wfMsg( 'allpages' ) );
        $out .= Xml::openElement( 'table', array( 'id' => 'nsselect', 'class' => 'allpages' ) );
        $out .= "<tr>
    <td class='mw-label'>" .
            Xml::label( wfMsg( 'allpagesfrom' ), 'nsfrom' ) .
            "   </td>
    <td class='mw-input'>" .
            Xml::input( 'from', 30, str_replace('_',' ',$from), array( 'id' => 'nsfrom' ) ) .
            "   </td>
</tr>
<tr>
    <td class='mw-label'>" .
            Xml::label( wfMsg( 'allpagesto' ), 'nsto' ) .
            "   </td>
            <td class='mw-input'>" .
            Xml::input( 'to', 30, str_replace('_',' ',$to), array( 'id' => 'nsto' ) ) .
            "       </td>
</tr>
<tr>
    <td class='mw-label'>" .
            Xml::label( wfMsg( 'namespace' ), 'namespace' ) .
            "   </td>
            <td class='mw-input'>" .
            Xml::namespaceSelector( $namespace, null ) . ' ' .
            Xml::submitButton( wfMsg( 'allpagessubmit' ) ) .
            "   </td>
</tr>";
        $out .= Xml::closeElement( 'table' );
        $out .= Xml::closeElement( 'fieldset' );
        $out .= Xml::closeElement( 'form' );
        $out .= Xml::closeElement( 'div' );
        return $out;
    }

    /**
     * @param $namespace Integer (default NS_MAIN)
     * @param $from String: list all pages from this name
     * @param $to String: list all pages to this name
     */
    function showToplevel( $namespace = NS_MAIN, $from = '', $to = '' ) {
        global $wgOut;

        # TODO: Either make this *much* faster or cache the title index points
        # in the querycache table.

        $dbr = wfGetDB( DB_SLAVE );
        $out = "";
        $where = array( 'page_namespace' => $namespace );

        $from = Title::makeTitleSafe( $namespace, $from );
        $to = Title::makeTitleSafe( $namespace, $to );
        $from = ( $from && $from->isLocal() ) ? $from->getDBkey() : null;
        $to = ( $to && $to->isLocal() ) ? $to->getDBkey() : null;

        if( isset($from) )
            $where[] = 'page_title >= '.$dbr->addQuotes( $from );
        if( isset($to) )
            $where[] = 'page_title <= '.$dbr->addQuotes( $to );

        global $wgMemc;
        $key = wfMemcKey( 'allpages', 'ns', $namespace, $from, $to );
        $lines = $wgMemc->get( $key );

        $count = $dbr->estimateRowCount( 'page', '*', $where, __METHOD__ );
        $maxPerSubpage = intval($count/$this->maxLineCount);
        $maxPerSubpage = max($maxPerSubpage,$this->maxPerPage);

        if( !is_array( $lines ) ) {
            $options = array( 'LIMIT' => 1 );
            $options['ORDER BY'] = 'page_title ASC';
            $firstTitle = $dbr->selectField( 'page', 'page_title', $where, __METHOD__, $options );
            $lastTitle = $firstTitle;
            # This array is going to hold the page_titles in order.
            $lines = array( $firstTitle );
            # If we are going to show n rows, we need n+1 queries to find the relevant titles.
            $done = false;
            while( !$done ) {
                // Fetch the last title of this chunk and the first of the next
                $chunk = ( $lastTitle === false )
                    ? array()
                    : array( 'page_title >= ' . $dbr->addQuotes( $lastTitle ) );
                $res = $dbr->select( 'page', /* FROM */
                    'page_title', /* WHAT */
                    array_merge($where,$chunk),
                    __METHOD__,
                    array ('LIMIT' => 2, 'OFFSET' => $maxPerSubpage - 1, 'ORDER BY' => 'page_title ASC')
                );

                $s = $dbr->fetchObject( $res );
                if( $s ) {
                    array_push( $lines, $s->page_title );
                } else {
                    // Final chunk, but ended prematurely. Go back and find the end.
                    $endTitle = $dbr->selectField( 'page', 'MAX(page_title)',
                        array_merge($where,$chunk),
                        __METHOD__ );
                    array_push( $lines, $endTitle );
                    $done = true;
                }
                $s = $res->fetchObject();
                if( $s ) {
                    array_push( $lines, $s->page_title );
                    $lastTitle = $s->page_title;
                } else {
                    // This was a final chunk and ended exactly at the limit.
                    // Rare but convenient!
                    $done = true;
                }
                $res->free();
            }
            $wgMemc->add( $key, $lines, 3600 );
        }

        // If there are only two or less sections, don't even display them.
        // Instead, display the first section directly.
        if( count( $lines ) <= 2 ) {
            if( !empty($lines) ) {
                $this->showChunk( $namespace, $from, $to );
            } else {
                $wgOut->addHTML( $this->namespaceForm( $namespace, $from, $to ) );
            }
            return;
        }

        # At this point, $lines should contain an even number of elements.
        $out .= Xml::openElement( 'table', array( 'class' => 'allpageslist' ) );
        while( count ( $lines ) > 0 ) {
            $inpoint = array_shift( $lines );
            $outpoint = array_shift( $lines );
            $out .= $this->showline( $inpoint, $outpoint, $namespace );
        }
        $out .= Xml::closeElement( 'table' );
        $nsForm = $this->namespaceForm( $namespace, $from, $to );

        # Is there more?
        if( $this->including() ) {
            $out2 = '';
        } else {
            if( isset($from) || isset($to) ) {
                global $wgUser;
                $out2 = Xml::openElement( 'table', array( 'class' => 'mw-allpages-table-form' ) ).
                        '<tr>
                            <td>' .
                                $nsForm .
                            '</td>
                            <td class="mw-allpages-nav">' .
                                $wgUser->getSkin()->link( $this->getTitle(), wfMsgHtml ( 'allpages' ),
                                    array(), array(), 'known' ) .
                            "</td>
                        </tr>" .
                    Xml::closeElement( 'table' );
            } else {
                $out2 = $nsForm;
            }
        }
        $wgOut->addHTML( $out2 . $out );
    }

    /**
     * Show a line of "ABC to DEF" ranges of articles
     *
     * @param $inpoint String: lower limit of pagenames
     * @param $outpoint String: upper limit of pagenames
     * @param $namespace Integer (Default NS_MAIN)
     */
    function showline( $inpoint, $outpoint, $namespace = NS_MAIN ) {
        global $wgContLang;
        $inpointf = htmlspecialchars( str_replace( '_', ' ', $inpoint ) );
        $outpointf = htmlspecialchars( str_replace( '_', ' ', $outpoint ) );
        // Don't let the length runaway
        $inpointf = $wgContLang->truncate( $inpointf, $this->maxPageLength );
        $outpointf = $wgContLang->truncate( $outpointf, $this->maxPageLength );

        $queryparams = $namespace ? "namespace=$namespace&" : '';
        $special = $this->getTitle();
        $link = $special->escapeLocalUrl( $queryparams . 'from=' . urlencode($inpoint) . '&to=' . urlencode($outpoint) );

        $out = wfMsgHtml( 'alphaindexline',
            "<a href=\"$link\">$inpointf</a></td><td>",
            "</td><td><a href=\"$link\">$outpointf</a>"
        );
        return '<tr><td class="mw-allpages-alphaindexline">' . $out . '</td></tr>';
    }

    /**
     * @param $namespace Integer (Default NS_MAIN)
     * @param $from String: list all pages from this name (default FALSE)
     * @param $to String: list all pages to this name (default FALSE)
     */
    function showChunk( $namespace = NS_MAIN, $from = false, $to = false ) {
        global $wgOut, $wgUser, $wgContLang, $wgLang;

        $sk = $wgUser->getSkin();

        $fromList = $this->getNamespaceKeyAndText($namespace, $from);
        $toList = $this->getNamespaceKeyAndText( $namespace, $to );
        $namespaces = $wgContLang->getNamespaces();
        $n = 0;

        if ( !$fromList || !$toList ) {
            $out = wfMsgWikiHtml( 'allpagesbadtitle' );
        } elseif ( !in_array( $namespace, array_keys( $namespaces ) ) ) {
            // Show errormessage and reset to NS_MAIN
            $out = wfMsgExt( 'allpages-bad-ns', array( 'parseinline' ), $namespace );
            $namespace = NS_MAIN;
        } else {
            list( $namespace, $fromKey, $from ) = $fromList;
            list( , $toKey, $to ) = $toList;

            $dbr = wfGetDB( DB_SLAVE );
            $conds = array(
                'page_namespace' => $namespace,
                'page_title >= ' . $dbr->addQuotes( $fromKey )
            );
            if( $toKey !== "" ) {
                $conds[] = 'page_title <= ' . $dbr->addQuotes( $toKey );
            }

            $res = $dbr->select( 'page',
                array( 'page_namespace', 'page_title', 'page_is_redirect' ),
                $conds,
                __METHOD__,
                array(
                    'ORDER BY'  => 'page_title',
                    'LIMIT'     => $this->maxPerPage + 1,
                    'USE INDEX' => 'name_title',
                )
            );

            if( $res->numRows() > 0 ) {
                $out = Xml::openElement( 'table', array( 'class' => 'mw-allpages-table-chunk' ) );
                while( ( $n < $this->maxPerPage ) && ( $s = $res->fetchObject() ) ) {
                    $t = Title::makeTitle( $s->page_namespace, $s->page_title );
                    if( $t ) {
                        $link = ( $s->page_is_redirect ? '<div class="allpagesredirect">' : '' ) .
                            $sk->linkKnown( $t, htmlspecialchars( $t->getText() ) ) .
                            ($s->page_is_redirect ? '</div>' : '' );
                    } else {
                        $link = '[[' . htmlspecialchars( $s->page_title ) . ']]';
                    }
                    if( $n % 3 == 0 ) {
                        $out .= '<tr>';
                    }
                    $out .= "<td style=\"width:33%\">$link</td>";
                    $n++;
                    if( $n % 3 == 0 ) {
                        $out .= "</tr>\n";
                    }
                }
                if( ($n % 3) != 0 ) {
                    $out .= "</tr>\n";
                }
                $out .= Xml::closeElement( 'table' );
            } else {
                $out = '';
            }
        }

        if ( $this->including() ) {
            $out2 = '';
        } else {
            if( $from == '' ) {
                // First chunk; no previous link.
                $prevTitle = null;
            } else {
                # Get the last title from previous chunk
                $dbr = wfGetDB( DB_SLAVE );
                $res_prev = $dbr->select(
                    'page',
                    'page_title',
                    array( 'page_namespace' => $namespace, 'page_title < '.$dbr->addQuotes($from) ),
                    __METHOD__,
                    array( 'ORDER BY' => 'page_title DESC', 
                        'LIMIT' => $this->maxPerPage, 'OFFSET' => ($this->maxPerPage - 1 )
                    )
                );

                # Get first title of previous complete chunk
                if( $dbr->numrows( $res_prev ) >= $this->maxPerPage ) {
                    $pt = $dbr->fetchObject( $res_prev );
                    $prevTitle = Title::makeTitle( $namespace, $pt->page_title );
                } else {
                    # The previous chunk is not complete, need to link to the very first title
                    # available in the database
                    $options = array( 'LIMIT' => 1 );
                    if ( ! $dbr->implicitOrderby() ) {
                        $options['ORDER BY'] = 'page_title';
                    }
                    $reallyFirstPage_title = $dbr->selectField( 'page', 'page_title',
                        array( 'page_namespace' => $namespace ), __METHOD__, $options );
                    # Show the previous link if it s not the current requested chunk
                    if( $from != $reallyFirstPage_title ) {
                        $prevTitle =  Title::makeTitle( $namespace, $reallyFirstPage_title );
                    } else {
                        $prevTitle = null;
                    }
                }
            }

            $self = $this->getTitle();

            $nsForm = $this->namespaceForm( $namespace, $from, $to );
            $out2 = Xml::openElement( 'table', array( 'class' => 'mw-allpages-table-form' ) ).
                        '<tr>
                            <td>' .
                                $nsForm .
                            '</td>
                            <td class="mw-allpages-nav">' .
                                $sk->link( $self, wfMsgHtml ( 'allpages' ), array(), array(), 'known' );

            # Do we put a previous link ?
            if( isset( $prevTitle ) &&  $pt = $prevTitle->getText() ) {
                $query = array( 'from' => $prevTitle->getText() );

                if( $namespace )
                    $query['namespace'] = $namespace;

                $prevLink = $sk->linkKnown(
                    $self,
                    htmlspecialchars( wfMsg( 'prevpage', $pt ) ),
                    array(),
                    $query
                );
                $out2 = $wgLang->pipeList( array( $out2, $prevLink ) );
            }

            if( $n == $this->maxPerPage && $s = $res->fetchObject() ) {
                # $s is the first link of the next chunk
                $t = Title::MakeTitle($namespace, $s->page_title);
                $query = array( 'from' => $t->getText() );

                if( $namespace )
                    $query['namespace'] = $namespace;

                $nextLink = $sk->linkKnown(
                    $self,
                    htmlspecialchars( wfMsg( 'nextpage', $t->getText() ) ),
                    array(),
                    $query
                );
                $out2 = $wgLang->pipeList( array( $out2, $nextLink ) );
            }
            $out2 .= "</td></tr></table>";
        }

        $wgOut->addHTML( $out2 . $out );
        if( isset($prevLink) or isset($nextLink) ) {
            $wgOut->addHTML( '<hr /><p class="mw-allpages-nav">' );
            if( isset( $prevLink ) ) {
                $wgOut->addHTML( $prevLink );
            }
            if( isset( $prevLink ) && isset( $nextLink ) ) {
                $wgOut->addHTML( wfMsgExt( 'pipe-separator' , 'escapenoentities' ) );
            }
            if( isset( $nextLink ) ) {
                $wgOut->addHTML( $nextLink );
            }
            $wgOut->addHTML( '</p>' );

        }

    }

    /**
     * @param $ns Integer: the namespace of the article
     * @param $text String: the name of the article
     * @return array( int namespace, string dbkey, string pagename ) or NULL on error
     * @static (sort of)
     * @access private
     */
    function getNamespaceKeyAndText($ns, $text) {
        if ( $text == '' )
            return array( $ns, '', '' ); # shortcut for common case

        $t = Title::makeTitleSafe($ns, $text);
        if ( $t && $t->isLocal() ) {
            return array( $t->getNamespace(), $t->getDBkey(), $t->getText() );
        } else if ( $t ) {
            return null;
        }

        # try again, in case the problem was an empty pagename
        $text = preg_replace('/(#|$)/', 'X$1', $text);
        $t = Title::makeTitleSafe($ns, $text);
        if ( $t && $t->isLocal() ) {
            return array( $t->getNamespace(), '', '' );
        } else {
            return null;
        }
    }
}

Solution

  • Not a great approach as you don't have a way of stopping when you get to the end of the list. You only want to split the items if there is more items than your maximum (although you may want to add some flexibility there, as you could get to the stage where you have two items on a page).

    I assume that the datasets would actually come from a database, but using your $items array for ease of display

    At its simplest, assuming it is coming from a web page that is sending an index number of the start and end, and that you have checked that those numbers are valid and sanitised

    $itemsPerPage = 50; // constant
    $itemStep = ($end - $start) / $itemsPerPage;
    
    if($itemStep < 1)
    {
        for($i = $start; $i < $end; $i++)
        {
            // display these as individual items
            display_link($items[$i]);
        }
    }
    else
    {
        for($i = $start; $i < $end; $i += $itemStep)
        {
            $to = $i + ($itemStep - 1); // find the end part
            if($to > $end)
                $to = $end;
            display_to_from($items[$i], $items[$to]);
        }
    }
    

    where the display functions display the links as you want. However, one of the problems doing it like that is that you may want to adjust the items per page, as you run the risk of having a set of (say) 51 and ending up with a link from 1 to 49, and another 50 to 51.

    I don't understand why you are arranging it in groups in your pseudocode, as you are going from page to page doing further chops, so you only need the start and end of each section, until you get to the page where all the links will fit.

    -- edit

    The original was wrong. Now you divide the amount of items you have to go through by the maximum items you want to display. If it is 1000, this will be listing ever 20 items, if it is 100,000 it will be every 2,000. If it is less than the amount you show, you can show them all individually.

    -- edit again - to add some more about the database

    No, you are right, you don't want to load 2,000,000 data records, and you don't have to. You have two options, you can make a prepared statement such as "select * from articles where article = ?" and loop through the results getting one at a time, or if you want to do it in one block - Assuming a mysql database and the code above,

    $numberArray = "";
    for($i = $start; $i < $end; $i += $itemStep)
    {
        $to = $i + ($itemStep - 1); // find the end part
        if($to > $end)
            $to = $end;
        // display_to_from($items[$i], $items[$to]);
        if( $i != $start)
            $numberArray += ", ";
        $numberArray.= $i.", ".$to;
    }
    $sqlQuery = "Select * from articles where article_id in (".$numberArray.")";
    ... do the mysql select and go through the results, using alternate rows as the start and end
    

    This gives you a query like 'Select * from articles where article_id in (1,49,50,99,100,149... etc)'

    The process that as a normal set