Binary search for students
$begingroup$
I'm putting together an example for students where I would like to show how we can optimize even such basic algorithm like a binary search with respect to hardware. I do not consider threads. I wrote it in C++ and the whole example can be found here. I'm trying to show some examples of the prefetching.
Test data If we would like to see the effects of the optimizations proposed in the following examples, then it is necessary to use data that are not in the CPU cache.
We are going to test it using some code like this
// arr - array with at least 100MB of values
// test_arr - array with values that we are going to search in arr
// test_count - size of the test_arr
for (auto i = 0u; i < test_count; ++i) {
auto position = binarySearch(item_count, arr, test_arr[i]);
}
Size - If the arr
array is significantly larger then largest CPU cache then most of the array won't be stored there and we will inevitably experience the L2 cache misses during the run. L2 cache misses are slow and it will become the major bottleneck of our binary search algorithm.
Randomness - It is better to have data in the test_arr
that accesses the arr
randomly, otherwise, the basic binary search algorithm will benefit from data access locality.
Please note that the L2 cache misses can be experienced even on a small array in some complex algorithm which works with a lot of data. In other words, we need a large array only for the purpose of testing, but the optimization can be useful everywhere where expect the binary search to happen on a data outside of CPU cache.
The basic algorithm It is something along these lines:
using Type = int64_t;
int binarySearch_basic(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (l <= r) {
int m = l + ((r - l) >> 1);
if (arr[m] == search) return m;
if (arr[m] < search) l = m + 1;
else r = m - 1;
}
return -1;
}
Prefetch The first optimization simply try to prefetch possible memory in advance
int binarySearch_basic(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (l <= r) {
int m1 = arr[l + ((r - l) >> 2)]; // new code
int m2 = arr[r - ((r - l) >> 2)]; // new code
int m = l + ((r - l) >> 1);
if (arr[m] == search) return m;
if (arr[m] < search) l = m + 1;
else r = m - 1;
}
return -1;
}
More items per iteration The second optimization is a little bit more sophisticated. It accesses more than one item per iteration which allows read more relevant data per CPU L2 cache miss wait.
int binarySearch_duo(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (r - l > 3) {
int delta = (r - l) / 3;
int m1 = l + delta;
int m2 = r - delta;
if (arr[m1] == search) return m1;
if (arr[m2] == search) return m2;
if (arr[m1] < search) {
if (arr[m2] < search) {
l = m2 + 1;
}
else {
r = m2 - 1;
l = m1 + 1;
}
}
else {
r = m1 - 1;
}
}
for (int i = l; i <= r; i++)
{
if (arr[i] == search) return i;
}
return -1;
}
Of course, this approach can be done with more accesses per iteration (I'm testing three as well in my demo). There are some infinite loop issues if the (l,r)
interval is very small (less then three), therefore, I finish the remaining items by a simple sequential search to keep the main iteration clean and simple.
I was doing my tests with Visual Studio 2017 C++ compiler.
I have several questions:
- Are there any other possible significant optimizations that could help the algorithm?
- Is my explanation of the improvement of the last algorithm correct? (is it really due to the fact that the CPU read in parallel two data items; therefore, he waits for it just once not twice like in the first algorithm).
- Are there any other comments about my code?
c++ performance binary-search
New contributor
$endgroup$
add a comment |
$begingroup$
I'm putting together an example for students where I would like to show how we can optimize even such basic algorithm like a binary search with respect to hardware. I do not consider threads. I wrote it in C++ and the whole example can be found here. I'm trying to show some examples of the prefetching.
Test data If we would like to see the effects of the optimizations proposed in the following examples, then it is necessary to use data that are not in the CPU cache.
We are going to test it using some code like this
// arr - array with at least 100MB of values
// test_arr - array with values that we are going to search in arr
// test_count - size of the test_arr
for (auto i = 0u; i < test_count; ++i) {
auto position = binarySearch(item_count, arr, test_arr[i]);
}
Size - If the arr
array is significantly larger then largest CPU cache then most of the array won't be stored there and we will inevitably experience the L2 cache misses during the run. L2 cache misses are slow and it will become the major bottleneck of our binary search algorithm.
Randomness - It is better to have data in the test_arr
that accesses the arr
randomly, otherwise, the basic binary search algorithm will benefit from data access locality.
Please note that the L2 cache misses can be experienced even on a small array in some complex algorithm which works with a lot of data. In other words, we need a large array only for the purpose of testing, but the optimization can be useful everywhere where expect the binary search to happen on a data outside of CPU cache.
The basic algorithm It is something along these lines:
using Type = int64_t;
int binarySearch_basic(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (l <= r) {
int m = l + ((r - l) >> 1);
if (arr[m] == search) return m;
if (arr[m] < search) l = m + 1;
else r = m - 1;
}
return -1;
}
Prefetch The first optimization simply try to prefetch possible memory in advance
int binarySearch_basic(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (l <= r) {
int m1 = arr[l + ((r - l) >> 2)]; // new code
int m2 = arr[r - ((r - l) >> 2)]; // new code
int m = l + ((r - l) >> 1);
if (arr[m] == search) return m;
if (arr[m] < search) l = m + 1;
else r = m - 1;
}
return -1;
}
More items per iteration The second optimization is a little bit more sophisticated. It accesses more than one item per iteration which allows read more relevant data per CPU L2 cache miss wait.
int binarySearch_duo(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (r - l > 3) {
int delta = (r - l) / 3;
int m1 = l + delta;
int m2 = r - delta;
if (arr[m1] == search) return m1;
if (arr[m2] == search) return m2;
if (arr[m1] < search) {
if (arr[m2] < search) {
l = m2 + 1;
}
else {
r = m2 - 1;
l = m1 + 1;
}
}
else {
r = m1 - 1;
}
}
for (int i = l; i <= r; i++)
{
if (arr[i] == search) return i;
}
return -1;
}
Of course, this approach can be done with more accesses per iteration (I'm testing three as well in my demo). There are some infinite loop issues if the (l,r)
interval is very small (less then three), therefore, I finish the remaining items by a simple sequential search to keep the main iteration clean and simple.
I was doing my tests with Visual Studio 2017 C++ compiler.
I have several questions:
- Are there any other possible significant optimizations that could help the algorithm?
- Is my explanation of the improvement of the last algorithm correct? (is it really due to the fact that the CPU read in parallel two data items; therefore, he waits for it just once not twice like in the first algorithm).
- Are there any other comments about my code?
c++ performance binary-search
New contributor
$endgroup$
$begingroup$
Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. Please see what you may and may not do after receiving answers. There is no immediate problem, but if you add more code we'll have to rollback the edit. Should your code be improved over time, post a follow-up question linking back to this one instead. Give it some time though.
$endgroup$
– Mast
14 mins ago
add a comment |
$begingroup$
I'm putting together an example for students where I would like to show how we can optimize even such basic algorithm like a binary search with respect to hardware. I do not consider threads. I wrote it in C++ and the whole example can be found here. I'm trying to show some examples of the prefetching.
Test data If we would like to see the effects of the optimizations proposed in the following examples, then it is necessary to use data that are not in the CPU cache.
We are going to test it using some code like this
// arr - array with at least 100MB of values
// test_arr - array with values that we are going to search in arr
// test_count - size of the test_arr
for (auto i = 0u; i < test_count; ++i) {
auto position = binarySearch(item_count, arr, test_arr[i]);
}
Size - If the arr
array is significantly larger then largest CPU cache then most of the array won't be stored there and we will inevitably experience the L2 cache misses during the run. L2 cache misses are slow and it will become the major bottleneck of our binary search algorithm.
Randomness - It is better to have data in the test_arr
that accesses the arr
randomly, otherwise, the basic binary search algorithm will benefit from data access locality.
Please note that the L2 cache misses can be experienced even on a small array in some complex algorithm which works with a lot of data. In other words, we need a large array only for the purpose of testing, but the optimization can be useful everywhere where expect the binary search to happen on a data outside of CPU cache.
The basic algorithm It is something along these lines:
using Type = int64_t;
int binarySearch_basic(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (l <= r) {
int m = l + ((r - l) >> 1);
if (arr[m] == search) return m;
if (arr[m] < search) l = m + 1;
else r = m - 1;
}
return -1;
}
Prefetch The first optimization simply try to prefetch possible memory in advance
int binarySearch_basic(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (l <= r) {
int m1 = arr[l + ((r - l) >> 2)]; // new code
int m2 = arr[r - ((r - l) >> 2)]; // new code
int m = l + ((r - l) >> 1);
if (arr[m] == search) return m;
if (arr[m] < search) l = m + 1;
else r = m - 1;
}
return -1;
}
More items per iteration The second optimization is a little bit more sophisticated. It accesses more than one item per iteration which allows read more relevant data per CPU L2 cache miss wait.
int binarySearch_duo(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (r - l > 3) {
int delta = (r - l) / 3;
int m1 = l + delta;
int m2 = r - delta;
if (arr[m1] == search) return m1;
if (arr[m2] == search) return m2;
if (arr[m1] < search) {
if (arr[m2] < search) {
l = m2 + 1;
}
else {
r = m2 - 1;
l = m1 + 1;
}
}
else {
r = m1 - 1;
}
}
for (int i = l; i <= r; i++)
{
if (arr[i] == search) return i;
}
return -1;
}
Of course, this approach can be done with more accesses per iteration (I'm testing three as well in my demo). There are some infinite loop issues if the (l,r)
interval is very small (less then three), therefore, I finish the remaining items by a simple sequential search to keep the main iteration clean and simple.
I was doing my tests with Visual Studio 2017 C++ compiler.
I have several questions:
- Are there any other possible significant optimizations that could help the algorithm?
- Is my explanation of the improvement of the last algorithm correct? (is it really due to the fact that the CPU read in parallel two data items; therefore, he waits for it just once not twice like in the first algorithm).
- Are there any other comments about my code?
c++ performance binary-search
New contributor
$endgroup$
I'm putting together an example for students where I would like to show how we can optimize even such basic algorithm like a binary search with respect to hardware. I do not consider threads. I wrote it in C++ and the whole example can be found here. I'm trying to show some examples of the prefetching.
Test data If we would like to see the effects of the optimizations proposed in the following examples, then it is necessary to use data that are not in the CPU cache.
We are going to test it using some code like this
// arr - array with at least 100MB of values
// test_arr - array with values that we are going to search in arr
// test_count - size of the test_arr
for (auto i = 0u; i < test_count; ++i) {
auto position = binarySearch(item_count, arr, test_arr[i]);
}
Size - If the arr
array is significantly larger then largest CPU cache then most of the array won't be stored there and we will inevitably experience the L2 cache misses during the run. L2 cache misses are slow and it will become the major bottleneck of our binary search algorithm.
Randomness - It is better to have data in the test_arr
that accesses the arr
randomly, otherwise, the basic binary search algorithm will benefit from data access locality.
Please note that the L2 cache misses can be experienced even on a small array in some complex algorithm which works with a lot of data. In other words, we need a large array only for the purpose of testing, but the optimization can be useful everywhere where expect the binary search to happen on a data outside of CPU cache.
The basic algorithm It is something along these lines:
using Type = int64_t;
int binarySearch_basic(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (l <= r) {
int m = l + ((r - l) >> 1);
if (arr[m] == search) return m;
if (arr[m] < search) l = m + 1;
else r = m - 1;
}
return -1;
}
Prefetch The first optimization simply try to prefetch possible memory in advance
int binarySearch_basic(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (l <= r) {
int m1 = arr[l + ((r - l) >> 2)]; // new code
int m2 = arr[r - ((r - l) >> 2)]; // new code
int m = l + ((r - l) >> 1);
if (arr[m] == search) return m;
if (arr[m] < search) l = m + 1;
else r = m - 1;
}
return -1;
}
More items per iteration The second optimization is a little bit more sophisticated. It accesses more than one item per iteration which allows read more relevant data per CPU L2 cache miss wait.
int binarySearch_duo(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (r - l > 3) {
int delta = (r - l) / 3;
int m1 = l + delta;
int m2 = r - delta;
if (arr[m1] == search) return m1;
if (arr[m2] == search) return m2;
if (arr[m1] < search) {
if (arr[m2] < search) {
l = m2 + 1;
}
else {
r = m2 - 1;
l = m1 + 1;
}
}
else {
r = m1 - 1;
}
}
for (int i = l; i <= r; i++)
{
if (arr[i] == search) return i;
}
return -1;
}
Of course, this approach can be done with more accesses per iteration (I'm testing three as well in my demo). There are some infinite loop issues if the (l,r)
interval is very small (less then three), therefore, I finish the remaining items by a simple sequential search to keep the main iteration clean and simple.
I was doing my tests with Visual Studio 2017 C++ compiler.
I have several questions:
- Are there any other possible significant optimizations that could help the algorithm?
- Is my explanation of the improvement of the last algorithm correct? (is it really due to the fact that the CPU read in parallel two data items; therefore, he waits for it just once not twice like in the first algorithm).
- Are there any other comments about my code?
c++ performance binary-search
c++ performance binary-search
New contributor
New contributor
edited 13 mins ago
Radim Bača
New contributor
asked 4 hours ago
Radim BačaRadim Bača
1214
1214
New contributor
New contributor
$begingroup$
Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. Please see what you may and may not do after receiving answers. There is no immediate problem, but if you add more code we'll have to rollback the edit. Should your code be improved over time, post a follow-up question linking back to this one instead. Give it some time though.
$endgroup$
– Mast
14 mins ago
add a comment |
$begingroup$
Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. Please see what you may and may not do after receiving answers. There is no immediate problem, but if you add more code we'll have to rollback the edit. Should your code be improved over time, post a follow-up question linking back to this one instead. Give it some time though.
$endgroup$
– Mast
14 mins ago
$begingroup$
Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. Please see what you may and may not do after receiving answers. There is no immediate problem, but if you add more code we'll have to rollback the edit. Should your code be improved over time, post a follow-up question linking back to this one instead. Give it some time though.
$endgroup$
– Mast
14 mins ago
$begingroup$
Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. Please see what you may and may not do after receiving answers. There is no immediate problem, but if you add more code we'll have to rollback the edit. Should your code be improved over time, post a follow-up question linking back to this one instead. Give it some time though.
$endgroup$
– Mast
14 mins ago
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
How did you measure the performance of your code (test code, hardware, compiler, etc.)? Without that it is impossible to get into optimization. Depending on chosen compiler I was sort of able to reproduce your results, see this quick bench- for clang 7.0.0 the fancy algorithm is indeed faster. That being said with gcc the basic one is a lot faster (outperforming fancy algorithm in both clang and gcc).
With that in mind, you can optimize your algorithm for different array sizes quite easily (benchmarking on your specific setup), by choosing fastest of the 3 for specific sizes.
As far as the code is concerned, there are a couple of things you can improve.
You could rewrite your algorithms so that they accept iterators, just like all the algorithms from standard library, that will drastically improve their flexibility without added effort (and most probably without degrading performance, but you should measure it anyway).
Apart from that, you should avoid implicit casts, e.g. here:
int r = item_count - 1;
You should also prefer to keep the bodies of your functions short, it's hard to understand what binarySearch_duo
actually does (you could easily extract body of the loop etc). Also, avoid magic constants, where does the seven in: while (r - l > 7) {
come from?
Edit: I've creted one more benchmark for this, this time as proposed by OP, that is with huge array size (actually a vector) and made the access to the elements of the array random, see here- this time with gcc fancy version is as good as basic. The lesson should probably be: compiler is usually at least as smart as you are :).
New contributor
$endgroup$
$begingroup$
You are using a small array that fits into the L2 cache (the CPU type is unknown). You can try to extend the size of the array to 1M to see the fancy code to win. However, it is still quite a small array and L2 cache misses do not occur significantly often. The number 7 is not important for the performance, it can be any small number lager than 3. I selected 7 since it approximately corresponds to one 64 cache line.
$endgroup$
– Radim Bača
1 hour ago
1
$begingroup$
Not necessarily, gcc even after increasing the count of elements in the array to10^6
elements produces better times for basic implementation. What I'm trying to say is that your results will vary depending on your benchmark and if you won't provide reliable way to measure performance, then it's tough to recommend further optimisations.
$endgroup$
– paler123
1 hour ago
$begingroup$
You are right. It is true that I did not specify the benchmark clearly here. I'll update the question.
$endgroup$
– Radim Bača
1 hour ago
$begingroup$
I have updated the question. I hope it is more clear now. Thank you for your recommendations.
$endgroup$
– Radim Bača
49 mins ago
$begingroup$
Concerning your benchmark - it is not large enough and also the queries into the array are not random.
$endgroup$
– Radim Bača
42 mins ago
|
show 2 more comments
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Radim Bača is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f212088%2fbinary-search-for-students%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
How did you measure the performance of your code (test code, hardware, compiler, etc.)? Without that it is impossible to get into optimization. Depending on chosen compiler I was sort of able to reproduce your results, see this quick bench- for clang 7.0.0 the fancy algorithm is indeed faster. That being said with gcc the basic one is a lot faster (outperforming fancy algorithm in both clang and gcc).
With that in mind, you can optimize your algorithm for different array sizes quite easily (benchmarking on your specific setup), by choosing fastest of the 3 for specific sizes.
As far as the code is concerned, there are a couple of things you can improve.
You could rewrite your algorithms so that they accept iterators, just like all the algorithms from standard library, that will drastically improve their flexibility without added effort (and most probably without degrading performance, but you should measure it anyway).
Apart from that, you should avoid implicit casts, e.g. here:
int r = item_count - 1;
You should also prefer to keep the bodies of your functions short, it's hard to understand what binarySearch_duo
actually does (you could easily extract body of the loop etc). Also, avoid magic constants, where does the seven in: while (r - l > 7) {
come from?
Edit: I've creted one more benchmark for this, this time as proposed by OP, that is with huge array size (actually a vector) and made the access to the elements of the array random, see here- this time with gcc fancy version is as good as basic. The lesson should probably be: compiler is usually at least as smart as you are :).
New contributor
$endgroup$
$begingroup$
You are using a small array that fits into the L2 cache (the CPU type is unknown). You can try to extend the size of the array to 1M to see the fancy code to win. However, it is still quite a small array and L2 cache misses do not occur significantly often. The number 7 is not important for the performance, it can be any small number lager than 3. I selected 7 since it approximately corresponds to one 64 cache line.
$endgroup$
– Radim Bača
1 hour ago
1
$begingroup$
Not necessarily, gcc even after increasing the count of elements in the array to10^6
elements produces better times for basic implementation. What I'm trying to say is that your results will vary depending on your benchmark and if you won't provide reliable way to measure performance, then it's tough to recommend further optimisations.
$endgroup$
– paler123
1 hour ago
$begingroup$
You are right. It is true that I did not specify the benchmark clearly here. I'll update the question.
$endgroup$
– Radim Bača
1 hour ago
$begingroup$
I have updated the question. I hope it is more clear now. Thank you for your recommendations.
$endgroup$
– Radim Bača
49 mins ago
$begingroup$
Concerning your benchmark - it is not large enough and also the queries into the array are not random.
$endgroup$
– Radim Bača
42 mins ago
|
show 2 more comments
$begingroup$
How did you measure the performance of your code (test code, hardware, compiler, etc.)? Without that it is impossible to get into optimization. Depending on chosen compiler I was sort of able to reproduce your results, see this quick bench- for clang 7.0.0 the fancy algorithm is indeed faster. That being said with gcc the basic one is a lot faster (outperforming fancy algorithm in both clang and gcc).
With that in mind, you can optimize your algorithm for different array sizes quite easily (benchmarking on your specific setup), by choosing fastest of the 3 for specific sizes.
As far as the code is concerned, there are a couple of things you can improve.
You could rewrite your algorithms so that they accept iterators, just like all the algorithms from standard library, that will drastically improve their flexibility without added effort (and most probably without degrading performance, but you should measure it anyway).
Apart from that, you should avoid implicit casts, e.g. here:
int r = item_count - 1;
You should also prefer to keep the bodies of your functions short, it's hard to understand what binarySearch_duo
actually does (you could easily extract body of the loop etc). Also, avoid magic constants, where does the seven in: while (r - l > 7) {
come from?
Edit: I've creted one more benchmark for this, this time as proposed by OP, that is with huge array size (actually a vector) and made the access to the elements of the array random, see here- this time with gcc fancy version is as good as basic. The lesson should probably be: compiler is usually at least as smart as you are :).
New contributor
$endgroup$
$begingroup$
You are using a small array that fits into the L2 cache (the CPU type is unknown). You can try to extend the size of the array to 1M to see the fancy code to win. However, it is still quite a small array and L2 cache misses do not occur significantly often. The number 7 is not important for the performance, it can be any small number lager than 3. I selected 7 since it approximately corresponds to one 64 cache line.
$endgroup$
– Radim Bača
1 hour ago
1
$begingroup$
Not necessarily, gcc even after increasing the count of elements in the array to10^6
elements produces better times for basic implementation. What I'm trying to say is that your results will vary depending on your benchmark and if you won't provide reliable way to measure performance, then it's tough to recommend further optimisations.
$endgroup$
– paler123
1 hour ago
$begingroup$
You are right. It is true that I did not specify the benchmark clearly here. I'll update the question.
$endgroup$
– Radim Bača
1 hour ago
$begingroup$
I have updated the question. I hope it is more clear now. Thank you for your recommendations.
$endgroup$
– Radim Bača
49 mins ago
$begingroup$
Concerning your benchmark - it is not large enough and also the queries into the array are not random.
$endgroup$
– Radim Bača
42 mins ago
|
show 2 more comments
$begingroup$
How did you measure the performance of your code (test code, hardware, compiler, etc.)? Without that it is impossible to get into optimization. Depending on chosen compiler I was sort of able to reproduce your results, see this quick bench- for clang 7.0.0 the fancy algorithm is indeed faster. That being said with gcc the basic one is a lot faster (outperforming fancy algorithm in both clang and gcc).
With that in mind, you can optimize your algorithm for different array sizes quite easily (benchmarking on your specific setup), by choosing fastest of the 3 for specific sizes.
As far as the code is concerned, there are a couple of things you can improve.
You could rewrite your algorithms so that they accept iterators, just like all the algorithms from standard library, that will drastically improve their flexibility without added effort (and most probably without degrading performance, but you should measure it anyway).
Apart from that, you should avoid implicit casts, e.g. here:
int r = item_count - 1;
You should also prefer to keep the bodies of your functions short, it's hard to understand what binarySearch_duo
actually does (you could easily extract body of the loop etc). Also, avoid magic constants, where does the seven in: while (r - l > 7) {
come from?
Edit: I've creted one more benchmark for this, this time as proposed by OP, that is with huge array size (actually a vector) and made the access to the elements of the array random, see here- this time with gcc fancy version is as good as basic. The lesson should probably be: compiler is usually at least as smart as you are :).
New contributor
$endgroup$
How did you measure the performance of your code (test code, hardware, compiler, etc.)? Without that it is impossible to get into optimization. Depending on chosen compiler I was sort of able to reproduce your results, see this quick bench- for clang 7.0.0 the fancy algorithm is indeed faster. That being said with gcc the basic one is a lot faster (outperforming fancy algorithm in both clang and gcc).
With that in mind, you can optimize your algorithm for different array sizes quite easily (benchmarking on your specific setup), by choosing fastest of the 3 for specific sizes.
As far as the code is concerned, there are a couple of things you can improve.
You could rewrite your algorithms so that they accept iterators, just like all the algorithms from standard library, that will drastically improve their flexibility without added effort (and most probably without degrading performance, but you should measure it anyway).
Apart from that, you should avoid implicit casts, e.g. here:
int r = item_count - 1;
You should also prefer to keep the bodies of your functions short, it's hard to understand what binarySearch_duo
actually does (you could easily extract body of the loop etc). Also, avoid magic constants, where does the seven in: while (r - l > 7) {
come from?
Edit: I've creted one more benchmark for this, this time as proposed by OP, that is with huge array size (actually a vector) and made the access to the elements of the array random, see here- this time with gcc fancy version is as good as basic. The lesson should probably be: compiler is usually at least as smart as you are :).
New contributor
edited 23 mins ago
New contributor
answered 2 hours ago
paler123paler123
1614
1614
New contributor
New contributor
$begingroup$
You are using a small array that fits into the L2 cache (the CPU type is unknown). You can try to extend the size of the array to 1M to see the fancy code to win. However, it is still quite a small array and L2 cache misses do not occur significantly often. The number 7 is not important for the performance, it can be any small number lager than 3. I selected 7 since it approximately corresponds to one 64 cache line.
$endgroup$
– Radim Bača
1 hour ago
1
$begingroup$
Not necessarily, gcc even after increasing the count of elements in the array to10^6
elements produces better times for basic implementation. What I'm trying to say is that your results will vary depending on your benchmark and if you won't provide reliable way to measure performance, then it's tough to recommend further optimisations.
$endgroup$
– paler123
1 hour ago
$begingroup$
You are right. It is true that I did not specify the benchmark clearly here. I'll update the question.
$endgroup$
– Radim Bača
1 hour ago
$begingroup$
I have updated the question. I hope it is more clear now. Thank you for your recommendations.
$endgroup$
– Radim Bača
49 mins ago
$begingroup$
Concerning your benchmark - it is not large enough and also the queries into the array are not random.
$endgroup$
– Radim Bača
42 mins ago
|
show 2 more comments
$begingroup$
You are using a small array that fits into the L2 cache (the CPU type is unknown). You can try to extend the size of the array to 1M to see the fancy code to win. However, it is still quite a small array and L2 cache misses do not occur significantly often. The number 7 is not important for the performance, it can be any small number lager than 3. I selected 7 since it approximately corresponds to one 64 cache line.
$endgroup$
– Radim Bača
1 hour ago
1
$begingroup$
Not necessarily, gcc even after increasing the count of elements in the array to10^6
elements produces better times for basic implementation. What I'm trying to say is that your results will vary depending on your benchmark and if you won't provide reliable way to measure performance, then it's tough to recommend further optimisations.
$endgroup$
– paler123
1 hour ago
$begingroup$
You are right. It is true that I did not specify the benchmark clearly here. I'll update the question.
$endgroup$
– Radim Bača
1 hour ago
$begingroup$
I have updated the question. I hope it is more clear now. Thank you for your recommendations.
$endgroup$
– Radim Bača
49 mins ago
$begingroup$
Concerning your benchmark - it is not large enough and also the queries into the array are not random.
$endgroup$
– Radim Bača
42 mins ago
$begingroup$
You are using a small array that fits into the L2 cache (the CPU type is unknown). You can try to extend the size of the array to 1M to see the fancy code to win. However, it is still quite a small array and L2 cache misses do not occur significantly often. The number 7 is not important for the performance, it can be any small number lager than 3. I selected 7 since it approximately corresponds to one 64 cache line.
$endgroup$
– Radim Bača
1 hour ago
$begingroup$
You are using a small array that fits into the L2 cache (the CPU type is unknown). You can try to extend the size of the array to 1M to see the fancy code to win. However, it is still quite a small array and L2 cache misses do not occur significantly often. The number 7 is not important for the performance, it can be any small number lager than 3. I selected 7 since it approximately corresponds to one 64 cache line.
$endgroup$
– Radim Bača
1 hour ago
1
1
$begingroup$
Not necessarily, gcc even after increasing the count of elements in the array to
10^6
elements produces better times for basic implementation. What I'm trying to say is that your results will vary depending on your benchmark and if you won't provide reliable way to measure performance, then it's tough to recommend further optimisations.$endgroup$
– paler123
1 hour ago
$begingroup$
Not necessarily, gcc even after increasing the count of elements in the array to
10^6
elements produces better times for basic implementation. What I'm trying to say is that your results will vary depending on your benchmark and if you won't provide reliable way to measure performance, then it's tough to recommend further optimisations.$endgroup$
– paler123
1 hour ago
$begingroup$
You are right. It is true that I did not specify the benchmark clearly here. I'll update the question.
$endgroup$
– Radim Bača
1 hour ago
$begingroup$
You are right. It is true that I did not specify the benchmark clearly here. I'll update the question.
$endgroup$
– Radim Bača
1 hour ago
$begingroup$
I have updated the question. I hope it is more clear now. Thank you for your recommendations.
$endgroup$
– Radim Bača
49 mins ago
$begingroup$
I have updated the question. I hope it is more clear now. Thank you for your recommendations.
$endgroup$
– Radim Bača
49 mins ago
$begingroup$
Concerning your benchmark - it is not large enough and also the queries into the array are not random.
$endgroup$
– Radim Bača
42 mins ago
$begingroup$
Concerning your benchmark - it is not large enough and also the queries into the array are not random.
$endgroup$
– Radim Bača
42 mins ago
|
show 2 more comments
Radim Bača is a new contributor. Be nice, and check out our Code of Conduct.
Radim Bača is a new contributor. Be nice, and check out our Code of Conduct.
Radim Bača is a new contributor. Be nice, and check out our Code of Conduct.
Radim Bača is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f212088%2fbinary-search-for-students%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. Please see what you may and may not do after receiving answers. There is no immediate problem, but if you add more code we'll have to rollback the edit. Should your code be improved over time, post a follow-up question linking back to this one instead. Give it some time though.
$endgroup$
– Mast
14 mins ago