{"id":45,"date":"2015-11-18T01:23:44","date_gmt":"2015-11-17T19:53:44","guid":{"rendered":"http:\/\/madhurendra.com\/?p=45"},"modified":"2016-11-02T08:16:12","modified_gmt":"2016-11-02T02:46:12","slug":"importing-large-csv-file-to-mongo","status":"publish","type":"post","link":"https:\/\/madhurendra.com\/importing-large-csv-file-to-mongo\/","title":{"rendered":"Importing large CSV file to Mongo."},"content":{"rendered":"

Someday you will need to import a large file to mongo or some other DBMS ,
\nYou start writing code for achieving same using NodeJS, but when you run memory usage start increasing and either core of your CPU is at 100%.<\/p>\n

<\/p>\n

var fs = require('fs'),\r\n    readline = require('readline'),\r\n    stream = require('stream'),\r\n    instream = fs.createReadStream(\"SomeFile\"),\r\n    k,\r\n buffer = [],\r\n c = 0,\r\n sCount = 0,\r\n lines = 0;\r\nconsole.log(\"Importing...\");\r\nrl.on('line', function (line) {\r\n    ++lines;\r\n    k = line.split(\",\");\r\n\r\n   buffer.push({p:k[0],q:k[1]});\r\n    c++;\r\n\r\n    if (c == 100000) {\r\n        rl.pause(); \/\/Pause the reader this won't help, as till system pauses some lines are already read.\r\n        insertDocument(buffer, function () {\r\n           \r\n            sCount += buffer.length;\r\n            buffer = []; \/\/clear buffer\r\n            c = 0;\r\n            rl.resume();\r\n        })\r\n    }\r\n\r\n});\r\n\r\nrl.on('close', function () {\r\n    insertDocument(buffer, function () {});\r\n    console.log(\"\\nProcessed [ Lines:\" + lines + \"]\");\r\n});\r\n<\/pre>\n
    \n
  1. Since nodejs is Async language you start getting `line` events without pausing for previous event to finish, You might get 10000 lines per second, depends on Disk Activity.<\/li>\n
  2. Now we created a buffer so that less calls should be made to database, but still till the database processes request node may push same buffer 100 times for processing.<\/li>\n
  3. To solve the above problem you might think pausing the reader till it records are not processed, Cool !, but it won’t work as expected because readline doesn’t pauses immediately, some events are still fired, so you might miss those – https:\/\/nodejs.org\/api\/readline.html#readline_rl_pause<\/a><\/li>\n
  4. You might think to add a delay of few ms, but it isn’t scalable solution, Bit easy solution is to create another buffer and use a bit to indicate start of insertion operation, events after that can be processed into new buffer and when database finished change the bit copy the buffer.<\/em><\/li>\n<\/ol>\n

    A better solution would be making async call. Mongo (MONGOIMPORT), Mysql (MYSQLDUMP) have inbuilt functions to process large CSV file
    \n<\/strong><\/p>\n

    mongoimport --db DATABASE --collection COLLECTION_NAME --type csv --headerline --file SOME_LARGE.csv<\/code><\/p>\n

    it works efficiently with a nice CLI showing progress.<\/p>\n","protected":false},"excerpt":{"rendered":"

    Someday you will need to import a large file to mongo or some other DBMS , You start writing code for achieving same using NodeJS, but when you run memory usage start increasing and either core of your CPU is at 100%.<\/p>

    Continue reading<\/a><\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","enabled":false},"version":2}},"categories":[1],"tags":[10,9,11],"class_list":["post-45","post","type-post","status-publish","format-standard","hentry","category-uncategorized","tag-database","tag-mongodb","tag-nodejs","item-wrap"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack-related-posts":[],"jetpack_shortlink":"https:\/\/wp.me\/pciGs2-J","jetpack_likes_enabled":true,"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/madhurendra.com\/wp-json\/wp\/v2\/posts\/45","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/madhurendra.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/madhurendra.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/madhurendra.com\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/madhurendra.com\/wp-json\/wp\/v2\/comments?post=45"}],"version-history":[{"count":5,"href":"https:\/\/madhurendra.com\/wp-json\/wp\/v2\/posts\/45\/revisions"}],"predecessor-version":[{"id":154,"href":"https:\/\/madhurendra.com\/wp-json\/wp\/v2\/posts\/45\/revisions\/154"}],"wp:attachment":[{"href":"https:\/\/madhurendra.com\/wp-json\/wp\/v2\/media?parent=45"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/madhurendra.com\/wp-json\/wp\/v2\/categories?post=45"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/madhurendra.com\/wp-json\/wp\/v2\/tags?post=45"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}