Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urlapi: leaner with fewer allocs #9408

Closed
wants to merge 4 commits into from
Closed

Conversation

bagder
Copy link
Member

@bagder bagder commented Sep 1, 2022

urlapi: leaner with fewer allocs

Slightly faster with more robust code. Uses fewer and smaller mallocs.

  • remove two fields from the URL handle struct
  • reduce copies and allocs
  • use dynbuf buffers more instead of custom malloc + copies
  • uses dynbuf to build the host name in reduces serial alloc+free within the same function.
  • move dedotdotify into urlapi.c and make it static, and not strdup the input
  • remove a few strlen() calls
  • add Curl_dyn_setlen() that can "trim" an existing dynbuf

This is my test program: speedparse.c

@bagder bagder marked this pull request as ready for review September 1, 2022 08:21
@bagder
Copy link
Member Author

bagder commented Sep 2, 2022

Stats on 2022-09-02

Running the test program on the same machine this PR vs master,

Speed test

Running the test code with 10000 loops.

PR:     16640000 URLs in 14.56434 secs, 0.875 us/URL, 1142516.585 URLs/sec
master: 16640000 URLs in 15.29473 secs, 0.919 us/URL, 1087956.226 URLs/sec

The PR seems to be ~5% faster in this test.

Memory use

(Given the test program 10 loops to run)

Code Allocations Maximum allocated Total allocated
PR 80360 7097 2770080
master 83010 15127 4372030
new 96.8% 46.9% 63.4%

@bagder bagder changed the title urlapi: remove two fields from the URL handle struct urlapi: leaner with fewer allocs Sep 2, 2022
@bagder
Copy link
Member Author

bagder commented Sep 2, 2022

Two commits later

Running the test program on the same machine this PR vs master,

Speed test

Running the test code with 10000 loops.

PR:     16640000 URLs in 14.96147 secs, 0.899 us/URL, 1112189.956 URLs/sec
master: 16640000 URLs in 15.29473 secs, 0.919 us/URL, 1087956.226 URLs/sec

The PR completes at roughly 97.8% of the old time.

Memory use

(Given the test program 10 loops to run, 16640 URLs)

Code Allocations Maximum allocated Total allocated
PR 70330 7097 3286640
master 83010 15127 4372030
new 84.7% 46.9% 75.2%

Note:

This is a debug build on an old machine. It is likely to run much faster in other conditions.

bagder added a commit that referenced this pull request Sep 2, 2022
Slightly faster with more robust code. Uses fewer and smaller mallocs.

- remove two fields from the URL handle struct
- reduce copies and allocs
- use dynbuf buffers more instead of custom malloc + copies
- uses dynbuf to build the host name in reduces serial alloc+free within
  the same function.

Closes #9408
bagder added a commit that referenced this pull request Sep 4, 2022
Slightly faster with more robust code. Uses fewer and smaller mallocs.

- remove two fields from the URL handle struct
- reduce copies and allocs
- use dynbuf buffers more instead of custom malloc + copies
- uses dynbuf to build the host name in reduces serial alloc+free within
  the same function.
- move dedotdotify into urlapi.c and make it static, and not strdup the input

Closes #9408
@bagder
Copy link
Member Author

bagder commented Sep 4, 2022

Stats on 2022-09-04

Running the test program on the same machine this PR vs master,

Speed test

Running the test code with 10000 loops.

PR:     16640000 URLs in 14.28950 secs, 0.859 us/URL, 1164491.410 URLs/sec
master: 16640000 URLs in 15.29473 secs, 0.919 us/URL, 1087956.226 URLs/sec

The PR completes at roughly 93.4% of the old time.

Memory use

(Given the test program 10 loops to run)

Code Allocations Maximum allocated Total allocated
PR 65700 6095 3184100
master 83010 15127 4372030
new 79.1% 40.3% 72.8%

@bagder
Copy link
Member Author

bagder commented Sep 5, 2022

The same PR and commit version in a non-debug build times like this:

16640000 URLs in 9.11436 secs, 0.548 us/URL, 1825689.447 URLs/sec

@bagder
Copy link
Member Author

bagder commented Sep 5, 2022

I ran valgrind --tool=callgrind ./debugit 100 to check which functions that use the most time and here are some lessons. Some of the > 2% spender functions:

106,299,100 ( 9.90%)  strcspn()
 66,310,404 ( 6.17%)  Curl_is_absolute_url
 33,243,200 ( 3.10%)  Curl_strcasecompare
 31,608,000 ( 2.94%)  Curl_isalnum
 29,712,704 ( 2.77%)  dedotdotify
 29,094,313 ( 2.71%)  strchr()
 28,529,000 ( 2.66%)  sscanf()
 27,335,266 ( 2.55%)  strlen()
 25,544,200 ( 2.38%)  Curl_raw_toupper

bagder added a commit that referenced this pull request Sep 5, 2022
Slightly faster with more robust code. Uses fewer and smaller mallocs.

- remove two fields from the URL handle struct
- reduce copies and allocs
- use dynbuf buffers more instead of custom malloc + copies
- uses dynbuf to build the host name in reduces serial alloc+free within
  the same function.
- move dedotdotify into urlapi.c and make it static, and not strdup the input

Closes #9408
@bagder
Copy link
Member Author

bagder commented Sep 5, 2022

Stats on 2022-09-05

Running the test program on the same machine this PR vs master,

Speed test

Running the test code with 10000 loops.

PR:     16640000 URLs in 13.77847 secs, 0.828 us/URL, 1207681.172 URLs/sec
master: 16640000 URLs in 15.29473 secs, 0.919 us/URL, 1087956.226 URLs/sec

The PR completes at roughly 90.1% of the old time.

Memory use

(Given the test program 10 loops to run)

Code Allocations Maximum allocated Total allocated
PR 65700 6096 3184100
master 83010 15127 4372030
new 79.1% 40.3% 72.8%

@bagder
Copy link
Member Author

bagder commented Sep 5, 2022

The ctype change is not a urlapi change at all, so I will split that out as a separate PR. Some of the performance gain in the last update may be attributed to that ctype PR.

bagder added a commit that referenced this pull request Sep 5, 2022
Slightly faster with more robust code. Uses fewer and smaller mallocs.

- remove two fields from the URL handle struct
- reduce copies and allocs
- use dynbuf buffers more instead of custom malloc + copies
- uses dynbuf to build the host name in reduces serial alloc+free within
  the same function.
- move dedotdotify into urlapi.c and make it static, and not strdup the input

Closes #9408
bagder added a commit that referenced this pull request Sep 6, 2022
Slightly faster with more robust code. Uses fewer and smaller mallocs.

- remove two fields from the URL handle struct
- reduce copies and allocs
- use dynbuf buffers more instead of custom malloc + copies
- uses dynbuf to build the host name in reduces serial alloc+free within
  the same function.
- move dedotdotify into urlapi.c and make it static, and not strdup the input
- remove a few strlen() calls
- add Curl_dyn_setlen() that can "trim" an existing dynbuf

Closes #9408
@bagder
Copy link
Member Author

bagder commented Sep 6, 2022

An optimized non-debug build on the same host and we are below half a microsecond on average:

16640000 URLs in 7.93909 secs, 0.477 us/URL, 2095959.421 URLs/sec

Slightly faster with more robust code. Uses fewer and smaller mallocs.

- remove two fields from the URL handle struct
- reduce copies and allocs
- use dynbuf buffers more instead of custom malloc + copies
- uses dynbuf to build the host name in reduces serial alloc+free within
  the same function.
- move dedotdotify into urlapi.c and make it static, not strdup the input
  and optimize it by checking for . and / before using strncmp
- remove a few strlen() calls
- add Curl_dyn_setlen() that can "trim" an existing dynbuf

Closes #9408
@bagder
Copy link
Member Author

bagder commented Sep 6, 2022

On my mac mini, the current PR improves things from

16640000 URLs in 15.17836 secs, 0.912 us/URL, 1096297.696 URLs/sec

to

16640000 URLs in 13.25299 secs, 0.796 us/URL, 1255566.104 URLs/sec

@bagder bagder closed this in f703cf9 Sep 7, 2022
@bagder bagder deleted the bagder/urlapi-reduce-struct branch September 7, 2022 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

1 participant