cURL / Mailing Lists / curl-users / Single Mail

curl-users

Re: Smart strategy to write million of files in OSX with bash curl?

From: Jim Young <j4young_at_gmail.com>
Date: Mon, 05 Jan 2015 18:02:34 -0600
On 1/5/2015 10:15 AM, Rodrigo Zanatta Silva wrote:
Hi. I am using the bash curl with max capacity than possible.

So, I create about 150 (or more if possible) script bash with a list of curl command and open all of then at same time (so I have 150 threads working). It work. Maybe it fail to download or write some file, but I can run another script and check if the file exist and download it again.

But.. There are time I will only download small html file (1k) but a really big number of them. 

My problem is: This can really make a mess in my system. I DAMAGE the partition of an HFS HD when I was working with my old macbook with OSX Lion (10.7.5) (I need to format the HD because the the mac program can't fix it, but it don't like to be hardware problem because the hd is working now). 

I thought that using the new OSX can be bether and using my principal computer with OS X Yosemite (10.10.1). 

After I write about 1 million of files, the finder was really slow in ALL system (not only in the folder with the files). I disable the indexing in this folder.

Now I don't know what strategy I can use. This is some I was thinking and don't know how to do:
  • Write all results in one file (Sometime ago I tried make bash write in one file from various threads and it failed miserably, need to use more complex strategy with file lock to do it)
  • Write all output of one thread in one file (so it will create 150 files)
    • In this strategy, how can I write "<filename>content <otherfilename>content..."
  • Write every file in disk but use some tool to not make it affect the system.
  • buffer in memory and write in disk time to time
    • Is there an easy way to do this?
Any idea? Maybe the internet is the slower part in system, so even if I lose some time writing, the cost is not so big at all.
I would approach the problem by downloading them as individual files using the -O option (as you presumably are doing currently) with the additional step of having curl print the file names using the -w option to stdout.  I'd then pipe stdout to a separate process (like zip or another shell script) which would use the list of file names to consolidate the individual files into a single large file.

-------------------------------------------------------------------
List admin: http://cool.haxx.se/list/listinfo/curl-users
FAQ: http://curl.haxx.se/docs/faq.html
Etiquette: http://curl.haxx.se/mail/etiquette.html
Received on 2015-01-06