Offline Wikipedia with Elasticsearch and MediaWiki

Wikipedia is Awesome ! It’s open, its free – Yea. & its huge in size, millions of articles  but as developer how to exploit the free knowledge.

I started digging internet just to find ways to exploit my fresh 9+ GB of XML Gzipped archive which seemed to me of no use as even a simple text editor can’t open it. (Just out of excitement what’s inside, how its structured, Schema ! )

Luckily people have already imported it. Elasticsearch is fast, reliable & its good for searching, so https://github.com/andrewvc/wikiparse was a saver.

  • Installed elastic search
  • Ran command to import

it took almost 48 hour in an i5, with 8gb ram – where mistake was i used same harddisk for data storage & database. Time might vary.

Data was imported but its still of no use ! Why ? Its in text/wiki format, parses is needed.

After doing search only solution i found was using mediawiki api, which is in PHP there were lots of things missing as its only for mediawiki but not for parsing plain text. (Though i didn’t spend much time in learning internal API)

I quickly downloaded mediawiki, ran nginx with php, installed it & used API.php.
it was good to see my offline API too, but still many things were missing, confusing, API has hard to modify structure. So i created a parse.php

So all steps were :

mm

Newbie (Student, Developer, Programmer and Electronic Hobbyist)

Leave a Reply

Your email address will not be published. Required fields are marked *