curl / Mailing Lists / curl-users / Single Mail

curl-users

Re: Further remarks on parallel "progress"

From: Timothe Litt <litt_at_acm.org>
Date: Mon, 29 Apr 2019 07:30:51 -0400

On 29-Apr-19 03:21, curl-users-request_at_cool.haxx.se wrote:
> Message: 2
> Date: Sun, 28 Apr 2019 23:14:29 +0200 (CEST)
> From: Daniel Stenberg <daniel_at_haxx.se>
> To: the curl tool <curl-users_at_cool.haxx.se>
> Subject: Re: curl-users Digest, Vol 164, Issue 12
> Message-ID: <alpine.DEB.2.20.1904282258140.5067_at_tvnag.unkk.fr>
> Content-Type: text/plain; charset="utf-8"; Format="flowed"
>
> On Sat, 27 Apr 2019, Timothe Litt wrote:
>
> Thanks for your reply!
>
>> I think I want to know about the outliers/problems, not so much the
>> successful transfers.
>>
>> How about at least #failed transfers?  That would fit on a 1-line status.
> But what are you going to use that info for? So in my example case, there
> could be a few transfers that returned a failure but since the other transfers
> aren't done yet you probably wouldn't stop the transfers anyway. Or am I
> wrong?

It depends.  If the number is large and the transfers are from a single
host, I expect I'd kill the transfers and look for another source, a
network issue, or a different time.

If the number is relatively small, I expect I'd let the transfers
continue -- and in parallel, open another terminal window to deal with
the outliers.  Humans can work in parallel too... The challenge here is
knowing which transfers failed.  Perhaps a --exceptions or --log file
would help in that case.

I don't expect parallel transfers to be very interesting for small
numbers of files.  For large numbers, either its a background process or
someone is watching it.  In the former case, progress meters won't be
used.  In the latter, better to take action than to wait for the end. 

The other question is where the list of files necessary to exploit
parallelism comes from.

There is curl's [] notation - if files are known in advance.   For
periodic data collection, the list of files can be in a local file -
almost has to be because shells have a finite command line length.

With FTP, a directory listing can be used to implement wildcards.  Many
ftp clients have used that.  With http, there are fewer standard formats
- apache httpd's default index listing is one.

But in most cases, large numbers of files will be in a (compressed)
archive of some sort.  (tar, zip, ...)

So the use case would seem to be where one is pulling data from multiple
hosts - e.g. daily log file collection, or sensor reports, or ...  In
that case, it won't be uncommon for some to be down, or have routing
issues, or... Plus, as with torrents, sourcing from multiple hosts is
likely to be the best way to get the inbound bandwidth necessary for
parallel transfers to be effective.  There's no point in 500 parallel
transfers from one host that has a 56K link[1].  But if I need data from
500 hosts, and I have a Gbit link...

In the case of multiple hosts, it's pretty clear what group of files
fail - for a routine transfer, it's likely to be all files from that
host.  So working to fix that while letting the other 499 run is worthwhile.

If it's only a few files from one host - perhaps you catch a log
directory while logrotate is running - again, you open another window to
pull the exception files.  (E.g. logrotate has zipped or renamed a file.)

All these cases point to:

A summary that lets a user know the extent of issues.

Availability (e.g. via a log or exception file) of a list of exceptions
while the transfer is running.

Most benefit from cases of MANY files, probably on many hosts.

[1] OK, 56K is unlikely.  But "slow" is quite possible.

>
> Perhaps we should output to the screen if an individual transfer fails, like
> "transfer of $URL returned XX: bla bla bla".

Yes, but better to put it into an exceptions or log file - that allows
for a process to deal with exceptions after the message rolls off the
screen.  The process might be a human, or a program.  So use a format
that is consistent and easy to parse.  perhaps

STATUS: URI, size, time, status_code, error_message

Then it would be easy to grep for ^FAILED, ^PENDING, ^COMPLETE, etc., or
split on /,/ and look for 404, 500, etc...

>> # stalled (no data sent/received in - pick a threshold, say 5 secs?) could
>> be helpful
> What would anyone do with that information during transfer?
Use it as an indicator to look for network issues...  # stalled should
be zero unless the network or remote host is sick.
>> The issue with "speed" is that while it tells you about curl's performance,
>> what a user wants to know is what needs attention - which transfers are NOT
>> making progress, or are slow relative to the pack e.g. (where picking a
>> different mirror might help).
> If that would be an issue and you want that control and the ability to stop
> the transfers and switch source etc, it seems like an odd choice to do several
> transfers in parallel then doesn't it?
>
>>> o percent download (if known, which means *all* transfers need to have a
>>> known size)
>> The more files are involved, the greater the chance that at least one
>> doesn't have a known size.
> Absolutely!
>
>> Why not compute the statistic for all the transfers that do?  Then add an
>> asterisk
>>
>> DL%
>>
>> 42*
>>
>> * Excluding 17 transfers whose size is unknown (or not yet known)
> Because it would be totally meaningless. What does 42% mean if it only is 42%
> of N and we *know* that N is not the total amount.

You are, of course, correct that the number is meaningless in an
absolute sense.  But it is a valid indicator of progress, and of what is
known.  Suppose I send someone out to inspect roads.  In a month, he
could come back and tell me that there are 6,179 potholes on 250 roads. 
And I've done nothing.  Or after the first day he could say "of the
first 10 roads I've inspected, 9 have more than 100 potholes each".  We
don't know the absolute number.  But I can certainly start to take
actions (planning, ordering materials, and repairs) - at least 27 days
sooner than waiting for absolute certainty.  It is always important to
indicate the limits of information, which is why the footnote is
important... The goal isn't perfection or mathematical correctness, it's
to be "good enough to be a hint for action".

Some information is better than none.  And by your argument, perhaps
status is not useful at all.  Consider queued files: If I queue 1,000
files, and 30 are active you CAN'T know the total size of the transfer,
since you haven't even contacted the host for 970 of them.  You can only
know the size of the the 30 that are active.  And if 29 provide length
information, and 1 does not, knowing that 42% of the 29 is complete
shows progress.  Not providing progress for the 29 because of one
exception seems limiting.  And if you can't provide status in the case
of queued files (because you can't know the total size), don't bother. 

The principle here is not to let exceptions limit information - provide
as much information as possible, and let the user deal with exceptions.

You are already doing that if you provide progress statistics that
exclude queued files... Where you have the size, use % of data.  Where
you don't, use % of the number of files.

>> How about a curses interface?  That would give you a status window where you
>> can put detailed information on each transfer - perhaps just one line each. 
>> curses windows can randomly update & can scroll, so you can have an
>> "unlimited" list. 
> I'm not dismissing that idea, and perhaps that is the most sensible option in
> the end, but that feels like a too big of a project for me personally to
> undertake. If someone else wants to experiment with that, then please feel
> free to do so.

It's not as big as it looks, at least for basic support.  You setup the
window, and instead of printf, use the curses routines.  After each
update, you call (w)refresh.  See: initscr() - initialize, mvprintw  -
got to y, x on window and printf.  Use the "ncurses" library. 

You must have a transfer list, so a pointer to the window & line would
be all the extra information you need.

Of course you can spend infinite time polishing the look... Since all we
need is an array of scrolling regions, or perhaps distinct pages, you
should avoid overlapping windows.  Those are difficult to get right.

I'm too far underwater to help.  But a web search for "ncurses example"
will turn up many samples.

Here are two:

http://www.tldp.org/HOWTO/NCURSES-Programming-HOWTO/

https://www.gnu.org/software/ncurses/ncurses-intro.html

Ignore the stuff about input - especially forms.  That's complicated,
not needed, and unnecessarily scary.

For inspiration, GUI clients such as FileZilla, ws_ftp, cyberduck,
winscp (and many others) are worth a look, as are some of the torrent
clients, such as qBittorrent.  Not for initiating/managing transfers,
but for how they queues and status.

>> If you  are unwilling to require curses support, another approach is to
>> build a status page as described, and define an key to print it.  So instead
>> of continuously updating status, you get a detailed dump when you press a
>> key (say space).
> Oh yes, that's pretty neat idea!

Historically, it's proven to work well - we used that technique for
status of long-running programs on timesharing systems.  It was VERY
popular when we introduced it to TOPS-10 in the 70s...  But that was a
1, at most 2 line status. (process status on the first - delta time, cpu
time, program name, IO rd/wr, state;  on second file and position if
blocked for IO)  Someday Linux will catch up...

On the other hand, if you dump 132 x 48 screens, it will get tiresome. 
For this mechanism, keep the size modest - this is why I suggested
several keys, one for each kind or portion of status.

I hope this helps.

Timothe Litt
ACM Distinguished Engineer
--------------------------
This communication may not represent the ACM or my employer's views,
if any, on the matters discussed.

>

-----------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-users
Etiquette: https://curl.haxx.se/mail/etiquette.html

Received on 2019-04-29