Offline Wikipedia with Elasticsearch and MediaWiki

Wikipedia is Awesome ! It’s open, its free – Yea. & its huge in size, millions of articles  but as developer how to exploit the free knowledge.

I started digging internet just to find ways to exploit my fresh 9+ GB of XML Gzipped archive which seemed to me of no use as even a simple text editor can’t open it. (Just out of excitement what’s inside, how its structured, Schema ! )

Luckily people have already imported it. Elasticsearch is fast, reliable & its good for searching, so https://github.com/andrewvc/wikiparse was a saver.

  • Installed elastic search
  • Ran command to import

it took almost 48 hour in an i5, with 8gb ram – where mistake was i used same harddisk for data storage & database. Time might vary.

Data was imported but its still of no use ! Why ? Its in text/wiki format, parses is needed.

After doing search only solution i found was using mediawiki api, which is in PHP there were lots of things missing as its only for mediawiki but not for parsing plain text. (Though i didn’t spend much time in learning internal API)

I quickly downloaded mediawiki, ran nginx with php, installed it & used API.php.
it was good to see my offline API too, but still many things were missing, confusing, API has hard to modify structure. So i created a parse.php

<?PHP
header('Access-Control-Allow-Origin: *');
header('Access-Control-Allow-Methods: GET, POST');  
header('Access-Control-Allow-Headers: Origin, X-Requested-With, Content-Type, Accept, Key');

function fixLink ($this, $nt, $section, $tooltip, &$result)  {
		$linkName = "editSection-$section"; // the span around the link
		$anchorName = "editSectionAnchor-$section"; // the actual link anchor, generated in Linker::makeHeadline
		$chromeName = "editSectionChrome-$section"; // the chrome that surrounds the anchor	
		$result = preg_replace('/(\D+)( title=)(\D+)/', '${1} class=\'editSectionLinkInactive\' id=\'' . $anchorName . '\' onmouseover="editSectionHighlightOn(' . $section . ')" onmouseout="editSectionHighlightOff(' . $section . ')" title=$3', $result);
		$result = preg_replace('/<\/span>/', '<span class="editSectionChromeInactive" id=\'' . $chromeName . '\'>&#8690;</span></span>', $result);
		// while resourceloader loads this extension's css pretty late, it's still
		// overriden by skins/common/shared.css.  to get around that, insert an explicit style here.
		// i'd welcome a better way to do this.
		$result ='';
		return true;
	}


require(dirname(__FILE__) . '/includes/WebStart.php');

$wgHooks['DoEditSectionLink'][] = 'fixLink';

//include("api.php");

$output = $wgParser->parse(
    $_POST['text'],
    Title::newFromText('Some page title'),
    new ParserOptions());
   
echo json_encode(array('html'=>$output->getText()));

So all steps were :

Leave a Reply

Your email address will not be published.