{"id":114,"date":"2016-08-18T22:02:44","date_gmt":"2016-08-18T16:32:44","guid":{"rendered":"http:\/\/madhurendra.com\/?p=114"},"modified":"2016-08-18T22:02:44","modified_gmt":"2016-08-18T16:32:44","slug":"offline-wikipedia-elasticsearch-mediawiki","status":"publish","type":"post","link":"https:\/\/madhurendra.com\/offline-wikipedia-elasticsearch-mediawiki\/","title":{"rendered":"Offline Wikipedia with Elasticsearch and MediaWiki"},"content":{"rendered":"<p>Wikipedia is Awesome ! It&#8217;s open, its free &#8211; Yea. &amp; its huge in size, <a href=\"https:\/\/en.wikipedia.org\/wiki\/Wikipedia:Statistics\" target=\"_blank\" rel=\"noopener\">millions of articles\u00a0<\/a>\u00a0but as developer how to exploit the free knowledge.<\/p>\n<p>I started digging internet just to find ways to exploit my fresh 9+ GB of XML Gzipped archive which seemed to me of no use as even a simple text editor can&#8217;t open it. (Just out of excitement what&#8217;s inside, how its structured, Schema ! )<\/p>\n<p>Luckily people have already imported it. Elasticsearch is fast, reliable &amp; its good for searching, so\u00a0<a href=\"https:\/\/github.com\/andrewvc\/wikiparse\" target=\"_blank\" rel=\"noopener\">https:\/\/github.com\/andrewvc\/wikiparse<\/a> was a saver.<\/p>\n<ul>\n<li>Installed elastic search<\/li>\n<li>Ran command to import<\/li>\n<\/ul>\n<p>it took almost 48 hour in an i5, with 8gb ram &#8211; where mistake was i used same harddisk for data storage &amp; database. Time might vary.<\/p>\n<p>Data was imported but its still of no use ! Why ? Its in text\/wiki format, parses is needed.<\/p>\n<p>After doing search only solution i found was using <a href=\"https:\/\/github.com\/wikimedia\/mediawiki\" target=\"_blank\" rel=\"noopener\">mediawiki<\/a> api, which is in PHP there were lots of things missing as its only for mediawiki but not for parsing plain text. (Though i didn&#8217;t spend much time in learning internal API)<\/p>\n<p>I quickly <a href=\"https:\/\/github.com\/wikimedia\/mediawiki\" target=\"_blank\" rel=\"noopener\">downloaded<\/a> <a href=\"https:\/\/github.com\/wikimedia\/mediawiki\" target=\"_blank\" rel=\"noopener\">mediawiki<\/a>, ran nginx with php, installed it &amp; used API.php.<br \/>\nit was good to see my offline API too, but still many things were missing, confusing, API has hard to modify structure. So i created a parse.php<\/p>\n<pre class=\"lang:php decode:true\">&lt;?PHP\r\nheader('Access-Control-Allow-Origin: *');\r\nheader('Access-Control-Allow-Methods: GET, POST');  \r\nheader('Access-Control-Allow-Headers: Origin, X-Requested-With, Content-Type, Accept, Key');\r\n\r\nfunction fixLink ($this, $nt, $section, $tooltip, &amp;$result)  {\r\n\t\t$linkName = \"editSection-$section\"; \/\/ the span around the link\r\n\t\t$anchorName = \"editSectionAnchor-$section\"; \/\/ the actual link anchor, generated in Linker::makeHeadline\r\n\t\t$chromeName = \"editSectionChrome-$section\"; \/\/ the chrome that surrounds the anchor\t\r\n\t\t$result = preg_replace('\/(\\D+)( title=)(\\D+)\/', '${1} class=\\'editSectionLinkInactive\\' id=\\'' . $anchorName . '\\' onmouseover=\"editSectionHighlightOn(' . $section . ')\" onmouseout=\"editSectionHighlightOff(' . $section . ')\" title=$3', $result);\r\n\t\t$result = preg_replace('\/&lt;\\\/span&gt;\/', '&lt;span class=\"editSectionChromeInactive\" id=\\'' . $chromeName . '\\'&gt;&amp;#8690;&lt;\/span&gt;&lt;\/span&gt;', $result);\r\n\t\t\/\/ while resourceloader loads this extension's css pretty late, it's still\r\n\t\t\/\/ overriden by skins\/common\/shared.css.  to get around that, insert an explicit style here.\r\n\t\t\/\/ i'd welcome a better way to do this.\r\n\t\t$result ='';\r\n\t\treturn true;\r\n\t}\r\n\r\n\r\nrequire(dirname(__FILE__) . '\/includes\/WebStart.php');\r\n\r\n$wgHooks['DoEditSectionLink'][] = 'fixLink';\r\n\r\n\/\/include(\"api.php\");\r\n\r\n$output = $wgParser-&gt;parse(\r\n    $_POST['text'],\r\n    Title::newFromText('Some page title'),\r\n    new ParserOptions());\r\n   \r\necho json_encode(array('html'=&gt;$output-&gt;getText()));\r\n\r\n<\/pre>\n<p>So all steps were :<\/p>\n<ul>\n<li><a href=\"https:\/\/www.elastic.co\/\" target=\"_blank\" rel=\"noopener\">Install elasticsearch<\/a>\u00a0&#8211;\u00a0\u00a0https:\/\/www.elastic.co\/<\/li>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Wikipedia:Database_download\" target=\"_blank\" rel=\"noopener\">Download Wikipedia<\/a> \u00a0&#8211;\u00a0https:\/\/en.wikipedia.org\/wiki\/Wikipedia:Database_download,\u00a0https:\/\/dumps.wikimedia.org\/enwiki\/latest\/<\/li>\n<li><a href=\"https:\/\/github.com\/andrewvc\/wikiparse\" target=\"_blank\" rel=\"noopener\">Import it using existing method<\/a>\u00a0 &#8211;\u00a0https:\/\/github.com\/andrewvc\/wikiparse<\/li>\n<li><a href=\"https:\/\/github.com\/wikimedia\/mediawiki\" target=\"_blank\" rel=\"noopener\">Parse Using WikiMedia API<\/a>\u00a0&#8211;\u00a0\u00a0https:\/\/github.com\/wikimedia\/mediawiki<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Wikipedia is Awesome ! It&#8217;s open, its free &#8211; Yea. &amp; its huge in size, millions of articles\u00a0\u00a0but as developer how to exploit the free knowledge. I started digging internet just to find ways to exploit my fresh 9+ GB of XML Gzipped archive which seemed to me of no use as even a simple &hellip; <\/p>\n<p><a class=\"more-link btn\" href=\"https:\/\/madhurendra.com\/offline-wikipedia-elasticsearch-mediawiki\/\">Continue reading<\/a><\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","enabled":false},"version":2}},"categories":[3,6,8],"tags":[],"class_list":["post-114","post","type-post","status-publish","format-standard","hentry","category-linux","category-server","category-snippets","item-wrap"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack-related-posts":[],"jetpack_shortlink":"https:\/\/wp.me\/pciGs2-1Q","jetpack_likes_enabled":true,"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/madhurendra.com\/wp-json\/wp\/v2\/posts\/114","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/madhurendra.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/madhurendra.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/madhurendra.com\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/madhurendra.com\/wp-json\/wp\/v2\/comments?post=114"}],"version-history":[{"count":2,"href":"https:\/\/madhurendra.com\/wp-json\/wp\/v2\/posts\/114\/revisions"}],"predecessor-version":[{"id":116,"href":"https:\/\/madhurendra.com\/wp-json\/wp\/v2\/posts\/114\/revisions\/116"}],"wp:attachment":[{"href":"https:\/\/madhurendra.com\/wp-json\/wp\/v2\/media?parent=114"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/madhurendra.com\/wp-json\/wp\/v2\/categories?post=114"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/madhurendra.com\/wp-json\/wp\/v2\/tags?post=114"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}