cURL / Mailing Lists / curl-users / Single Mail

curl-users

Re: Smart strategy to write million of files in OSX with bash curl?

From: Dan Fandrich <dan_at_coneharvesters.com>
Date: Tue, 6 Jan 2015 10:22:51 +0100

On Mon, Jan 05, 2015 at 02:15:22PM -0200, Rodrigo Zanatta Silva wrote:
> Hi. I am using the bash curl with max capacity than possible.
>
> So, I create about 150 (or more if possible) script bash with a list of curl
> command and open all of then at same time (so I have 150 threads working). It
> work. Maybe it fail to download or write some file, but I can run another
> script and check if the file exist and download it again.
>
> But.. There are time I will only download small html file (1k) but a really big
> number of them. 
>
> My problem is: This can really make a mess in my system. I DAMAGE the partition
> of an HFS HD when I was working with my old macbook with OSX Lion (10.7.5) (I
> need to format the HD because the the mac program can't fix it, but it don't
> like to be hardware problem because the hd is working now). 

Filesystem damage indicates a kernel bug or hardware failure. It's too bad
pushing it so hard causes this. You may want to contact Apple about this,
especially if you have a reliable test case.

> I thought that using the new OSX can be bether and using my principal computer
> with OS X Yosemite (10.10.1). 
>
> After I write about 1 million of files, the finder was really slow in ALL
> system (not only in the folder with the files). I disable the indexing in this
> folder.

Are you writing all these millions of files into a single directory? Many
filesystems don't handle that case well and devolve into pathologically slow
behaviour. The solution to this is to either use another filesystem, write the
files into a database instead, or shard the directory. The latter solution may
be the easiest, and involves creating a hierarchy of directories into which the
files are stored which can be one or more levels deep. This is how git stores
its files, for example. Rather than having one huge .git/objects directory into
which all the objects are placed, there's a second layer of 256 directories
containing the first two hexadecimal digits of the object names. This reduces
the size of a single directory by a factor of 256. That's probably not enough
for your case as it would still leave tens of thousands of files in a single
directory, so you'd likely have to create a second level of directories inside
the first.

>>> Dan
-------------------------------------------------------------------
List admin: http://cool.haxx.se/list/listinfo/curl-users
FAQ: http://curl.haxx.se/docs/faq.html
Etiquette: http://curl.haxx.se/mail/etiquette.html
Received on 2015-01-06