{"id":114,"date":"2016-08-18T22:02:44","date_gmt":"2016-08-18T16:32:44","guid":{"rendered":"http:\/\/madhurendra.com\/?p=114"},"modified":"2016-08-18T22:02:44","modified_gmt":"2016-08-18T16:32:44","slug":"offline-wikipedia-elasticsearch-mediawiki","status":"publish","type":"post","link":"https:\/\/madhurendra.com\/offline-wikipedia-elasticsearch-mediawiki\/","title":{"rendered":"Offline Wikipedia with Elasticsearch and MediaWiki"},"content":{"rendered":"
Wikipedia is Awesome ! It’s open, its free – Yea. & its huge in size, millions of articles\u00a0<\/a>\u00a0but as developer how to exploit the free knowledge.<\/p>\n I started digging internet just to find ways to exploit my fresh 9+ GB of XML Gzipped archive which seemed to me of no use as even a simple text editor can’t open it. (Just out of excitement what’s inside, how its structured, Schema ! )<\/p>\n Luckily people have already imported it. Elasticsearch is fast, reliable & its good for searching, so\u00a0https:\/\/github.com\/andrewvc\/wikiparse<\/a> was a saver.<\/p>\n it took almost 48 hour in an i5, with 8gb ram – where mistake was i used same harddisk for data storage & database. Time might vary.<\/p>\n Data was imported but its still of no use ! Why ? Its in text\/wiki format, parses is needed.<\/p>\n After doing search only solution i found was using mediawiki<\/a> api, which is in PHP there were lots of things missing as its only for mediawiki but not for parsing plain text. (Though i didn’t spend much time in learning internal API)<\/p>\n I quickly downloaded<\/a> mediawiki<\/a>, ran nginx with php, installed it & used API.php. So all steps were :<\/p>\n Wikipedia is Awesome ! It’s open, its free – Yea. & its huge in size, millions of articles\u00a0\u00a0but as developer how to exploit the free knowledge. I started digging internet just to find ways to exploit my fresh 9+ GB of XML Gzipped archive which seemed to me of no use as even a simple … <\/p>\n\n
\nit was good to see my offline API too, but still many things were missing, confusing, API has hard to modify structure. So i created a parse.php<\/p>\n<?PHP\r\nheader('Access-Control-Allow-Origin: *');\r\nheader('Access-Control-Allow-Methods: GET, POST'); \r\nheader('Access-Control-Allow-Headers: Origin, X-Requested-With, Content-Type, Accept, Key');\r\n\r\nfunction fixLink ($this, $nt, $section, $tooltip, &$result) {\r\n\t\t$linkName = \"editSection-$section\"; \/\/ the span around the link\r\n\t\t$anchorName = \"editSectionAnchor-$section\"; \/\/ the actual link anchor, generated in Linker::makeHeadline\r\n\t\t$chromeName = \"editSectionChrome-$section\"; \/\/ the chrome that surrounds the anchor\t\r\n\t\t$result = preg_replace('\/(\\D+)( title=)(\\D+)\/', '${1} class=\\'editSectionLinkInactive\\' id=\\'' . $anchorName . '\\' onmouseover=\"editSectionHighlightOn(' . $section . ')\" onmouseout=\"editSectionHighlightOff(' . $section . ')\" title=$3', $result);\r\n\t\t$result = preg_replace('\/<\\\/span>\/', '<span class=\"editSectionChromeInactive\" id=\\'' . $chromeName . '\\'>⇲<\/span><\/span>', $result);\r\n\t\t\/\/ while resourceloader loads this extension's css pretty late, it's still\r\n\t\t\/\/ overriden by skins\/common\/shared.css. to get around that, insert an explicit style here.\r\n\t\t\/\/ i'd welcome a better way to do this.\r\n\t\t$result ='';\r\n\t\treturn true;\r\n\t}\r\n\r\n\r\nrequire(dirname(__FILE__) . '\/includes\/WebStart.php');\r\n\r\n$wgHooks['DoEditSectionLink'][] = 'fixLink';\r\n\r\n\/\/include(\"api.php\");\r\n\r\n$output = $wgParser->parse(\r\n $_POST['text'],\r\n Title::newFromText('Some page title'),\r\n new ParserOptions());\r\n \r\necho json_encode(array('html'=>$output->getText()));\r\n\r\n<\/pre>\n
\n