cURL / Mailing Lists / curl-users / Single Mail

curl-users

Re: Smart strategy to write million of files in OSX with bash curl?

From: Rodrigo Zanatta Silva <rodrigozanattasilva_at_gmail.com>
Date: Fri, 16 Jan 2015 22:00:33 -0200

*....*

> Filesystem damage indicates a kernel bug or hardware failure. It's too
bad
> pushing it so hard causes this. You may want to contact Apple about this,
> especially if you have a reliable test case.

Lol.. I will never know why this happens. But I get two kernel panic and
never have seen it before... If you want to know, after I connect it with a
secondary drive, I can read the files, But any program solve the logical
problem, so I just formated everything.

> Are you writing all these millions of files into a single directory? Many
> filesystems don't handle that case well and devolve into pathologically
slow
> behaviour. The solution to this is to either use another filesystem,
write the
> files into a database instead, or shard the directory. The latter
solution may
> be the easiest, and involves creating a hierarchy of directories into
which the
> files are stored which can be one or more levels deep. This is how git
stores
> its files, for example. Rather than having one huge .git/objects
directory into
> which all the objects are placed, there's a second layer of 256
directories
> containing the first two hexadecimal digits of the object names. This
reduces
> the size of a single directory by a factor of 256. That's probably not
enough
> for your case as it would still leave tens of thousands of files in a
single
> directory, so you'd likely have to create a second level of directories
inside
> the first.

*>>> Dan *

I REALLY loved your idea. Why I don't thought about it. By the way, do you
know the "best" number of file I can save in one directory? Like, the
better is less than 1.000 or maybe power of two, like 511 or 1023 files per
directory? And use 256 subdirectory for every directory?

About write in a database. Yeah... I can use the Sqlite. And it will solve
all the problem with filesystem.

Hum... What do you think is the best strategy for common operate system (I
will create a open source program and want it work in linux/windows/mac).
First of all, the easier way is write bash script and open all of them in
background. So:

   - Use a tree of subdirectory
   - Use the sqlite. This is a little more complex but is the exactly
   answer for "one file"
   - In this way, make 150 bash script write in the same file (so the bash
   and operation system solve the problem not me) or create 150 database for
   every thread and after if finish, join the database?
   - And... Working with database, the best way to DOWNLOAD a page with
   Curl is save the file in disk and then write in database (after it, delete
   the file) or only use the curl to download in the memory and write in
   database. Remember, this is 150 threads doing the same.

Ps. Really sorry. I thought if someone answer my thread, I was see the
e-mail alone and not in Digest. Now I config the system better. Because
this I don't see the email for a while.

-------------------------------------------------------------------
List admin: http://cool.haxx.se/list/listinfo/curl-users
FAQ: http://curl.haxx.se/docs/faq.html
Etiquette: http://curl.haxx.se/mail/etiquette.html
Received on 2015-01-17