Wikipedia is Awesome ! It’s open, its free – Yea. & its huge in size, millions of articles but as developer how to exploit the free knowledge.
I started digging internet just to find ways to exploit my fresh 9+ GB of XML Gzipped archive which seemed to me of no use as even a simple text editor can’t open it. (Just out of excitement what’s inside, how its structured, Schema ! )
Luckily people have already imported it. Elasticsearch is fast, reliable & its good for searching, so https://github.com/andrewvc/wikiparse was a saver.
- Installed elastic search
- Ran command to import
it took almost 48 hour in an i5, with 8gb ram – where mistake was i used same harddisk for data storage & database. Time might vary.
Data was imported but its still of no use ! Why ? Its in text/wiki format, parses is needed.
After doing search only solution i found was using mediawiki api, which is in PHP there were lots of things missing as its only for mediawiki but not for parsing plain text. (Though i didn’t spend much time in learning internal API)
I quickly downloaded mediawiki, ran nginx with php, installed it & used API.php.
it was good to see my offline API too, but still many things were missing, confusing, API has hard to modify structure. So i created a parse.php
<?PHP header('Access-Control-Allow-Origin: *'); header('Access-Control-Allow-Methods: GET, POST'); header('Access-Control-Allow-Headers: Origin, X-Requested-With, Content-Type, Accept, Key'); function fixLink ($this, $nt, $section, $tooltip, &$result) { $linkName = "editSection-$section"; // the span around the link $anchorName = "editSectionAnchor-$section"; // the actual link anchor, generated in Linker::makeHeadline $chromeName = "editSectionChrome-$section"; // the chrome that surrounds the anchor $result = preg_replace('/(\D+)( title=)(\D+)/', '${1} class=\'editSectionLinkInactive\' id=\'' . $anchorName . '\' onmouseover="editSectionHighlightOn(' . $section . ')" onmouseout="editSectionHighlightOff(' . $section . ')" title=$3', $result); $result = preg_replace('/<\/span>/', '<span class="editSectionChromeInactive" id=\'' . $chromeName . '\'>⇲</span></span>', $result); // while resourceloader loads this extension's css pretty late, it's still // overriden by skins/common/shared.css. to get around that, insert an explicit style here. // i'd welcome a better way to do this. $result =''; return true; } require(dirname(__FILE__) . '/includes/WebStart.php'); $wgHooks['DoEditSectionLink'][] = 'fixLink'; //include("api.php"); $output = $wgParser->parse( $_POST['text'], Title::newFromText('Some page title'), new ParserOptions()); echo json_encode(array('html'=>$output->getText()));
So all steps were :
- Install elasticsearch – https://www.elastic.co/
- Download Wikipedia – https://en.wikipedia.org/wiki/Wikipedia:Database_download, https://dumps.wikimedia.org/enwiki/latest/
- Import it using existing method – https://github.com/andrewvc/wikiparse
- Parse Using WikiMedia API – https://github.com/wikimedia/mediawiki
Recent Comments