Binary search for students












4












$begingroup$


I'm putting together an example for students where I would like to show how we can optimize even such basic algorithm like a binary search with respect to hardware. I do not consider threads. I wrote it in C++ and the whole example can be found here. I'm trying to show some examples of the prefetching.



Test data If we would like to see the effects of the optimizations proposed in the following examples, then it is necessary to use data that are not in the CPU cache.



We are going to test it using some code like this



// arr - array with at least 100MB of values
// test_arr - array with values that we are going to search in arr
// test_count - size of the test_arr
for (auto i = 0u; i < test_count; ++i) {
auto position = binarySearch(item_count, arr, test_arr[i]);
}


Size - If the arr array is significantly larger then largest CPU cache then most of the array won't be stored there and we will inevitably experience the L2 cache misses during the run. L2 cache misses are slow and it will become the major bottleneck of our binary search algorithm.



Randomness - It is better to have data in the test_arr that accesses the arr randomly, otherwise, the basic binary search algorithm will benefit from data access locality.



Please note that the L2 cache misses can be experienced even on a small array in some complex algorithm which works with a lot of data. In other words, we need a large array only for the purpose of testing, but the optimization can be useful everywhere where expect the binary search to happen on a data outside of CPU cache.



The basic algorithm It is something along these lines:



using Type = int64_t;
int binarySearch_basic(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (l <= r) {
int m = l + ((r - l) >> 1);
if (arr[m] == search) return m;
if (arr[m] < search) l = m + 1;
else r = m - 1;
}
return -1;
}


Prefetch The first optimization simply try to prefetch possible memory in advance



int binarySearch_basic(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (l <= r) {
int m1 = arr[l + ((r - l) >> 2)]; // new code
int m2 = arr[r - ((r - l) >> 2)]; // new code

int m = l + ((r - l) >> 1);
if (arr[m] == search) return m;
if (arr[m] < search) l = m + 1;
else r = m - 1;
}
return -1;
}


More items per iteration The second optimization is a little bit more sophisticated. It accesses more than one item per iteration which allows read more relevant data per CPU L2 cache miss wait.



int binarySearch_duo(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (r - l > 3) {

int delta = (r - l) / 3;
int m1 = l + delta;
int m2 = r - delta;

if (arr[m1] == search) return m1;
if (arr[m2] == search) return m2;

if (arr[m1] < search) {
if (arr[m2] < search) {
l = m2 + 1;
}
else {
r = m2 - 1;
l = m1 + 1;
}
}
else {
r = m1 - 1;
}
}

for (int i = l; i <= r; i++)
{
if (arr[i] == search) return i;
}

return -1;
}


Of course, this approach can be done with more accesses per iteration (I'm testing three as well in my demo). There are some infinite loop issues if the (l,r) interval is very small (less then three), therefore, I finish the remaining items by a simple sequential search to keep the main iteration clean and simple.



I was doing my tests with Visual Studio 2017 C++ compiler.



I have several questions:




  • Are there any other possible significant optimizations that could help the algorithm?

  • Is my explanation of the improvement of the last algorithm correct? (is it really due to the fact that the CPU read in parallel two data items; therefore, he waits for it just once not twice like in the first algorithm).

  • Are there any other comments about my code?










share|improve this question









New contributor




Radim Bača is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$












  • $begingroup$
    Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. Please see what you may and may not do after receiving answers. There is no immediate problem, but if you add more code we'll have to rollback the edit. Should your code be improved over time, post a follow-up question linking back to this one instead. Give it some time though.
    $endgroup$
    – Mast
    14 mins ago
















4












$begingroup$


I'm putting together an example for students where I would like to show how we can optimize even such basic algorithm like a binary search with respect to hardware. I do not consider threads. I wrote it in C++ and the whole example can be found here. I'm trying to show some examples of the prefetching.



Test data If we would like to see the effects of the optimizations proposed in the following examples, then it is necessary to use data that are not in the CPU cache.



We are going to test it using some code like this



// arr - array with at least 100MB of values
// test_arr - array with values that we are going to search in arr
// test_count - size of the test_arr
for (auto i = 0u; i < test_count; ++i) {
auto position = binarySearch(item_count, arr, test_arr[i]);
}


Size - If the arr array is significantly larger then largest CPU cache then most of the array won't be stored there and we will inevitably experience the L2 cache misses during the run. L2 cache misses are slow and it will become the major bottleneck of our binary search algorithm.



Randomness - It is better to have data in the test_arr that accesses the arr randomly, otherwise, the basic binary search algorithm will benefit from data access locality.



Please note that the L2 cache misses can be experienced even on a small array in some complex algorithm which works with a lot of data. In other words, we need a large array only for the purpose of testing, but the optimization can be useful everywhere where expect the binary search to happen on a data outside of CPU cache.



The basic algorithm It is something along these lines:



using Type = int64_t;
int binarySearch_basic(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (l <= r) {
int m = l + ((r - l) >> 1);
if (arr[m] == search) return m;
if (arr[m] < search) l = m + 1;
else r = m - 1;
}
return -1;
}


Prefetch The first optimization simply try to prefetch possible memory in advance



int binarySearch_basic(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (l <= r) {
int m1 = arr[l + ((r - l) >> 2)]; // new code
int m2 = arr[r - ((r - l) >> 2)]; // new code

int m = l + ((r - l) >> 1);
if (arr[m] == search) return m;
if (arr[m] < search) l = m + 1;
else r = m - 1;
}
return -1;
}


More items per iteration The second optimization is a little bit more sophisticated. It accesses more than one item per iteration which allows read more relevant data per CPU L2 cache miss wait.



int binarySearch_duo(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (r - l > 3) {

int delta = (r - l) / 3;
int m1 = l + delta;
int m2 = r - delta;

if (arr[m1] == search) return m1;
if (arr[m2] == search) return m2;

if (arr[m1] < search) {
if (arr[m2] < search) {
l = m2 + 1;
}
else {
r = m2 - 1;
l = m1 + 1;
}
}
else {
r = m1 - 1;
}
}

for (int i = l; i <= r; i++)
{
if (arr[i] == search) return i;
}

return -1;
}


Of course, this approach can be done with more accesses per iteration (I'm testing three as well in my demo). There are some infinite loop issues if the (l,r) interval is very small (less then three), therefore, I finish the remaining items by a simple sequential search to keep the main iteration clean and simple.



I was doing my tests with Visual Studio 2017 C++ compiler.



I have several questions:




  • Are there any other possible significant optimizations that could help the algorithm?

  • Is my explanation of the improvement of the last algorithm correct? (is it really due to the fact that the CPU read in parallel two data items; therefore, he waits for it just once not twice like in the first algorithm).

  • Are there any other comments about my code?










share|improve this question









New contributor




Radim Bača is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$












  • $begingroup$
    Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. Please see what you may and may not do after receiving answers. There is no immediate problem, but if you add more code we'll have to rollback the edit. Should your code be improved over time, post a follow-up question linking back to this one instead. Give it some time though.
    $endgroup$
    – Mast
    14 mins ago














4












4








4





$begingroup$


I'm putting together an example for students where I would like to show how we can optimize even such basic algorithm like a binary search with respect to hardware. I do not consider threads. I wrote it in C++ and the whole example can be found here. I'm trying to show some examples of the prefetching.



Test data If we would like to see the effects of the optimizations proposed in the following examples, then it is necessary to use data that are not in the CPU cache.



We are going to test it using some code like this



// arr - array with at least 100MB of values
// test_arr - array with values that we are going to search in arr
// test_count - size of the test_arr
for (auto i = 0u; i < test_count; ++i) {
auto position = binarySearch(item_count, arr, test_arr[i]);
}


Size - If the arr array is significantly larger then largest CPU cache then most of the array won't be stored there and we will inevitably experience the L2 cache misses during the run. L2 cache misses are slow and it will become the major bottleneck of our binary search algorithm.



Randomness - It is better to have data in the test_arr that accesses the arr randomly, otherwise, the basic binary search algorithm will benefit from data access locality.



Please note that the L2 cache misses can be experienced even on a small array in some complex algorithm which works with a lot of data. In other words, we need a large array only for the purpose of testing, but the optimization can be useful everywhere where expect the binary search to happen on a data outside of CPU cache.



The basic algorithm It is something along these lines:



using Type = int64_t;
int binarySearch_basic(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (l <= r) {
int m = l + ((r - l) >> 1);
if (arr[m] == search) return m;
if (arr[m] < search) l = m + 1;
else r = m - 1;
}
return -1;
}


Prefetch The first optimization simply try to prefetch possible memory in advance



int binarySearch_basic(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (l <= r) {
int m1 = arr[l + ((r - l) >> 2)]; // new code
int m2 = arr[r - ((r - l) >> 2)]; // new code

int m = l + ((r - l) >> 1);
if (arr[m] == search) return m;
if (arr[m] < search) l = m + 1;
else r = m - 1;
}
return -1;
}


More items per iteration The second optimization is a little bit more sophisticated. It accesses more than one item per iteration which allows read more relevant data per CPU L2 cache miss wait.



int binarySearch_duo(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (r - l > 3) {

int delta = (r - l) / 3;
int m1 = l + delta;
int m2 = r - delta;

if (arr[m1] == search) return m1;
if (arr[m2] == search) return m2;

if (arr[m1] < search) {
if (arr[m2] < search) {
l = m2 + 1;
}
else {
r = m2 - 1;
l = m1 + 1;
}
}
else {
r = m1 - 1;
}
}

for (int i = l; i <= r; i++)
{
if (arr[i] == search) return i;
}

return -1;
}


Of course, this approach can be done with more accesses per iteration (I'm testing three as well in my demo). There are some infinite loop issues if the (l,r) interval is very small (less then three), therefore, I finish the remaining items by a simple sequential search to keep the main iteration clean and simple.



I was doing my tests with Visual Studio 2017 C++ compiler.



I have several questions:




  • Are there any other possible significant optimizations that could help the algorithm?

  • Is my explanation of the improvement of the last algorithm correct? (is it really due to the fact that the CPU read in parallel two data items; therefore, he waits for it just once not twice like in the first algorithm).

  • Are there any other comments about my code?










share|improve this question









New contributor




Radim Bača is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$




I'm putting together an example for students where I would like to show how we can optimize even such basic algorithm like a binary search with respect to hardware. I do not consider threads. I wrote it in C++ and the whole example can be found here. I'm trying to show some examples of the prefetching.



Test data If we would like to see the effects of the optimizations proposed in the following examples, then it is necessary to use data that are not in the CPU cache.



We are going to test it using some code like this



// arr - array with at least 100MB of values
// test_arr - array with values that we are going to search in arr
// test_count - size of the test_arr
for (auto i = 0u; i < test_count; ++i) {
auto position = binarySearch(item_count, arr, test_arr[i]);
}


Size - If the arr array is significantly larger then largest CPU cache then most of the array won't be stored there and we will inevitably experience the L2 cache misses during the run. L2 cache misses are slow and it will become the major bottleneck of our binary search algorithm.



Randomness - It is better to have data in the test_arr that accesses the arr randomly, otherwise, the basic binary search algorithm will benefit from data access locality.



Please note that the L2 cache misses can be experienced even on a small array in some complex algorithm which works with a lot of data. In other words, we need a large array only for the purpose of testing, but the optimization can be useful everywhere where expect the binary search to happen on a data outside of CPU cache.



The basic algorithm It is something along these lines:



using Type = int64_t;
int binarySearch_basic(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (l <= r) {
int m = l + ((r - l) >> 1);
if (arr[m] == search) return m;
if (arr[m] < search) l = m + 1;
else r = m - 1;
}
return -1;
}


Prefetch The first optimization simply try to prefetch possible memory in advance



int binarySearch_basic(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (l <= r) {
int m1 = arr[l + ((r - l) >> 2)]; // new code
int m2 = arr[r - ((r - l) >> 2)]; // new code

int m = l + ((r - l) >> 1);
if (arr[m] == search) return m;
if (arr[m] < search) l = m + 1;
else r = m - 1;
}
return -1;
}


More items per iteration The second optimization is a little bit more sophisticated. It accesses more than one item per iteration which allows read more relevant data per CPU L2 cache miss wait.



int binarySearch_duo(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (r - l > 3) {

int delta = (r - l) / 3;
int m1 = l + delta;
int m2 = r - delta;

if (arr[m1] == search) return m1;
if (arr[m2] == search) return m2;

if (arr[m1] < search) {
if (arr[m2] < search) {
l = m2 + 1;
}
else {
r = m2 - 1;
l = m1 + 1;
}
}
else {
r = m1 - 1;
}
}

for (int i = l; i <= r; i++)
{
if (arr[i] == search) return i;
}

return -1;
}


Of course, this approach can be done with more accesses per iteration (I'm testing three as well in my demo). There are some infinite loop issues if the (l,r) interval is very small (less then three), therefore, I finish the remaining items by a simple sequential search to keep the main iteration clean and simple.



I was doing my tests with Visual Studio 2017 C++ compiler.



I have several questions:




  • Are there any other possible significant optimizations that could help the algorithm?

  • Is my explanation of the improvement of the last algorithm correct? (is it really due to the fact that the CPU read in parallel two data items; therefore, he waits for it just once not twice like in the first algorithm).

  • Are there any other comments about my code?







c++ performance binary-search






share|improve this question









New contributor




Radim Bača is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question









New contributor




Radim Bača is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question








edited 13 mins ago







Radim Bača













New contributor




Radim Bača is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked 4 hours ago









Radim BačaRadim Bača

1214




1214




New contributor




Radim Bača is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





Radim Bača is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






Radim Bača is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.












  • $begingroup$
    Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. Please see what you may and may not do after receiving answers. There is no immediate problem, but if you add more code we'll have to rollback the edit. Should your code be improved over time, post a follow-up question linking back to this one instead. Give it some time though.
    $endgroup$
    – Mast
    14 mins ago


















  • $begingroup$
    Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. Please see what you may and may not do after receiving answers. There is no immediate problem, but if you add more code we'll have to rollback the edit. Should your code be improved over time, post a follow-up question linking back to this one instead. Give it some time though.
    $endgroup$
    – Mast
    14 mins ago
















$begingroup$
Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. Please see what you may and may not do after receiving answers. There is no immediate problem, but if you add more code we'll have to rollback the edit. Should your code be improved over time, post a follow-up question linking back to this one instead. Give it some time though.
$endgroup$
– Mast
14 mins ago




$begingroup$
Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. Please see what you may and may not do after receiving answers. There is no immediate problem, but if you add more code we'll have to rollback the edit. Should your code be improved over time, post a follow-up question linking back to this one instead. Give it some time though.
$endgroup$
– Mast
14 mins ago










1 Answer
1






active

oldest

votes


















4












$begingroup$

How did you measure the performance of your code (test code, hardware, compiler, etc.)? Without that it is impossible to get into optimization. Depending on chosen compiler I was sort of able to reproduce your results, see this quick bench- for clang 7.0.0 the fancy algorithm is indeed faster. That being said with gcc the basic one is a lot faster (outperforming fancy algorithm in both clang and gcc).

With that in mind, you can optimize your algorithm for different array sizes quite easily (benchmarking on your specific setup), by choosing fastest of the 3 for specific sizes.



As far as the code is concerned, there are a couple of things you can improve.



You could rewrite your algorithms so that they accept iterators, just like all the algorithms from standard library, that will drastically improve their flexibility without added effort (and most probably without degrading performance, but you should measure it anyway).
Apart from that, you should avoid implicit casts, e.g. here:



int r = item_count - 1;


You should also prefer to keep the bodies of your functions short, it's hard to understand what binarySearch_duo actually does (you could easily extract body of the loop etc). Also, avoid magic constants, where does the seven in: while (r - l > 7) { come from?



Edit: I've creted one more benchmark for this, this time as proposed by OP, that is with huge array size (actually a vector) and made the access to the elements of the array random, see here- this time with gcc fancy version is as good as basic. The lesson should probably be: compiler is usually at least as smart as you are :).






share|improve this answer










New contributor




paler123 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$













  • $begingroup$
    You are using a small array that fits into the L2 cache (the CPU type is unknown). You can try to extend the size of the array to 1M to see the fancy code to win. However, it is still quite a small array and L2 cache misses do not occur significantly often. The number 7 is not important for the performance, it can be any small number lager than 3. I selected 7 since it approximately corresponds to one 64 cache line.
    $endgroup$
    – Radim Bača
    1 hour ago






  • 1




    $begingroup$
    Not necessarily, gcc even after increasing the count of elements in the array to 10^6 elements produces better times for basic implementation. What I'm trying to say is that your results will vary depending on your benchmark and if you won't provide reliable way to measure performance, then it's tough to recommend further optimisations.
    $endgroup$
    – paler123
    1 hour ago










  • $begingroup$
    You are right. It is true that I did not specify the benchmark clearly here. I'll update the question.
    $endgroup$
    – Radim Bača
    1 hour ago












  • $begingroup$
    I have updated the question. I hope it is more clear now. Thank you for your recommendations.
    $endgroup$
    – Radim Bača
    49 mins ago










  • $begingroup$
    Concerning your benchmark - it is not large enough and also the queries into the array are not random.
    $endgroup$
    – Radim Bača
    42 mins ago











Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});






Radim Bača is a new contributor. Be nice, and check out our Code of Conduct.










draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f212088%2fbinary-search-for-students%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









4












$begingroup$

How did you measure the performance of your code (test code, hardware, compiler, etc.)? Without that it is impossible to get into optimization. Depending on chosen compiler I was sort of able to reproduce your results, see this quick bench- for clang 7.0.0 the fancy algorithm is indeed faster. That being said with gcc the basic one is a lot faster (outperforming fancy algorithm in both clang and gcc).

With that in mind, you can optimize your algorithm for different array sizes quite easily (benchmarking on your specific setup), by choosing fastest of the 3 for specific sizes.



As far as the code is concerned, there are a couple of things you can improve.



You could rewrite your algorithms so that they accept iterators, just like all the algorithms from standard library, that will drastically improve their flexibility without added effort (and most probably without degrading performance, but you should measure it anyway).
Apart from that, you should avoid implicit casts, e.g. here:



int r = item_count - 1;


You should also prefer to keep the bodies of your functions short, it's hard to understand what binarySearch_duo actually does (you could easily extract body of the loop etc). Also, avoid magic constants, where does the seven in: while (r - l > 7) { come from?



Edit: I've creted one more benchmark for this, this time as proposed by OP, that is with huge array size (actually a vector) and made the access to the elements of the array random, see here- this time with gcc fancy version is as good as basic. The lesson should probably be: compiler is usually at least as smart as you are :).






share|improve this answer










New contributor




paler123 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$













  • $begingroup$
    You are using a small array that fits into the L2 cache (the CPU type is unknown). You can try to extend the size of the array to 1M to see the fancy code to win. However, it is still quite a small array and L2 cache misses do not occur significantly often. The number 7 is not important for the performance, it can be any small number lager than 3. I selected 7 since it approximately corresponds to one 64 cache line.
    $endgroup$
    – Radim Bača
    1 hour ago






  • 1




    $begingroup$
    Not necessarily, gcc even after increasing the count of elements in the array to 10^6 elements produces better times for basic implementation. What I'm trying to say is that your results will vary depending on your benchmark and if you won't provide reliable way to measure performance, then it's tough to recommend further optimisations.
    $endgroup$
    – paler123
    1 hour ago










  • $begingroup$
    You are right. It is true that I did not specify the benchmark clearly here. I'll update the question.
    $endgroup$
    – Radim Bača
    1 hour ago












  • $begingroup$
    I have updated the question. I hope it is more clear now. Thank you for your recommendations.
    $endgroup$
    – Radim Bača
    49 mins ago










  • $begingroup$
    Concerning your benchmark - it is not large enough and also the queries into the array are not random.
    $endgroup$
    – Radim Bača
    42 mins ago
















4












$begingroup$

How did you measure the performance of your code (test code, hardware, compiler, etc.)? Without that it is impossible to get into optimization. Depending on chosen compiler I was sort of able to reproduce your results, see this quick bench- for clang 7.0.0 the fancy algorithm is indeed faster. That being said with gcc the basic one is a lot faster (outperforming fancy algorithm in both clang and gcc).

With that in mind, you can optimize your algorithm for different array sizes quite easily (benchmarking on your specific setup), by choosing fastest of the 3 for specific sizes.



As far as the code is concerned, there are a couple of things you can improve.



You could rewrite your algorithms so that they accept iterators, just like all the algorithms from standard library, that will drastically improve their flexibility without added effort (and most probably without degrading performance, but you should measure it anyway).
Apart from that, you should avoid implicit casts, e.g. here:



int r = item_count - 1;


You should also prefer to keep the bodies of your functions short, it's hard to understand what binarySearch_duo actually does (you could easily extract body of the loop etc). Also, avoid magic constants, where does the seven in: while (r - l > 7) { come from?



Edit: I've creted one more benchmark for this, this time as proposed by OP, that is with huge array size (actually a vector) and made the access to the elements of the array random, see here- this time with gcc fancy version is as good as basic. The lesson should probably be: compiler is usually at least as smart as you are :).






share|improve this answer










New contributor




paler123 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$













  • $begingroup$
    You are using a small array that fits into the L2 cache (the CPU type is unknown). You can try to extend the size of the array to 1M to see the fancy code to win. However, it is still quite a small array and L2 cache misses do not occur significantly often. The number 7 is not important for the performance, it can be any small number lager than 3. I selected 7 since it approximately corresponds to one 64 cache line.
    $endgroup$
    – Radim Bača
    1 hour ago






  • 1




    $begingroup$
    Not necessarily, gcc even after increasing the count of elements in the array to 10^6 elements produces better times for basic implementation. What I'm trying to say is that your results will vary depending on your benchmark and if you won't provide reliable way to measure performance, then it's tough to recommend further optimisations.
    $endgroup$
    – paler123
    1 hour ago










  • $begingroup$
    You are right. It is true that I did not specify the benchmark clearly here. I'll update the question.
    $endgroup$
    – Radim Bača
    1 hour ago












  • $begingroup$
    I have updated the question. I hope it is more clear now. Thank you for your recommendations.
    $endgroup$
    – Radim Bača
    49 mins ago










  • $begingroup$
    Concerning your benchmark - it is not large enough and also the queries into the array are not random.
    $endgroup$
    – Radim Bača
    42 mins ago














4












4








4





$begingroup$

How did you measure the performance of your code (test code, hardware, compiler, etc.)? Without that it is impossible to get into optimization. Depending on chosen compiler I was sort of able to reproduce your results, see this quick bench- for clang 7.0.0 the fancy algorithm is indeed faster. That being said with gcc the basic one is a lot faster (outperforming fancy algorithm in both clang and gcc).

With that in mind, you can optimize your algorithm for different array sizes quite easily (benchmarking on your specific setup), by choosing fastest of the 3 for specific sizes.



As far as the code is concerned, there are a couple of things you can improve.



You could rewrite your algorithms so that they accept iterators, just like all the algorithms from standard library, that will drastically improve their flexibility without added effort (and most probably without degrading performance, but you should measure it anyway).
Apart from that, you should avoid implicit casts, e.g. here:



int r = item_count - 1;


You should also prefer to keep the bodies of your functions short, it's hard to understand what binarySearch_duo actually does (you could easily extract body of the loop etc). Also, avoid magic constants, where does the seven in: while (r - l > 7) { come from?



Edit: I've creted one more benchmark for this, this time as proposed by OP, that is with huge array size (actually a vector) and made the access to the elements of the array random, see here- this time with gcc fancy version is as good as basic. The lesson should probably be: compiler is usually at least as smart as you are :).






share|improve this answer










New contributor




paler123 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$



How did you measure the performance of your code (test code, hardware, compiler, etc.)? Without that it is impossible to get into optimization. Depending on chosen compiler I was sort of able to reproduce your results, see this quick bench- for clang 7.0.0 the fancy algorithm is indeed faster. That being said with gcc the basic one is a lot faster (outperforming fancy algorithm in both clang and gcc).

With that in mind, you can optimize your algorithm for different array sizes quite easily (benchmarking on your specific setup), by choosing fastest of the 3 for specific sizes.



As far as the code is concerned, there are a couple of things you can improve.



You could rewrite your algorithms so that they accept iterators, just like all the algorithms from standard library, that will drastically improve their flexibility without added effort (and most probably without degrading performance, but you should measure it anyway).
Apart from that, you should avoid implicit casts, e.g. here:



int r = item_count - 1;


You should also prefer to keep the bodies of your functions short, it's hard to understand what binarySearch_duo actually does (you could easily extract body of the loop etc). Also, avoid magic constants, where does the seven in: while (r - l > 7) { come from?



Edit: I've creted one more benchmark for this, this time as proposed by OP, that is with huge array size (actually a vector) and made the access to the elements of the array random, see here- this time with gcc fancy version is as good as basic. The lesson should probably be: compiler is usually at least as smart as you are :).







share|improve this answer










New contributor




paler123 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this answer



share|improve this answer








edited 23 mins ago





















New contributor




paler123 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









answered 2 hours ago









paler123paler123

1614




1614




New contributor




paler123 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





paler123 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






paler123 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.












  • $begingroup$
    You are using a small array that fits into the L2 cache (the CPU type is unknown). You can try to extend the size of the array to 1M to see the fancy code to win. However, it is still quite a small array and L2 cache misses do not occur significantly often. The number 7 is not important for the performance, it can be any small number lager than 3. I selected 7 since it approximately corresponds to one 64 cache line.
    $endgroup$
    – Radim Bača
    1 hour ago






  • 1




    $begingroup$
    Not necessarily, gcc even after increasing the count of elements in the array to 10^6 elements produces better times for basic implementation. What I'm trying to say is that your results will vary depending on your benchmark and if you won't provide reliable way to measure performance, then it's tough to recommend further optimisations.
    $endgroup$
    – paler123
    1 hour ago










  • $begingroup$
    You are right. It is true that I did not specify the benchmark clearly here. I'll update the question.
    $endgroup$
    – Radim Bača
    1 hour ago












  • $begingroup$
    I have updated the question. I hope it is more clear now. Thank you for your recommendations.
    $endgroup$
    – Radim Bača
    49 mins ago










  • $begingroup$
    Concerning your benchmark - it is not large enough and also the queries into the array are not random.
    $endgroup$
    – Radim Bača
    42 mins ago


















  • $begingroup$
    You are using a small array that fits into the L2 cache (the CPU type is unknown). You can try to extend the size of the array to 1M to see the fancy code to win. However, it is still quite a small array and L2 cache misses do not occur significantly often. The number 7 is not important for the performance, it can be any small number lager than 3. I selected 7 since it approximately corresponds to one 64 cache line.
    $endgroup$
    – Radim Bača
    1 hour ago






  • 1




    $begingroup$
    Not necessarily, gcc even after increasing the count of elements in the array to 10^6 elements produces better times for basic implementation. What I'm trying to say is that your results will vary depending on your benchmark and if you won't provide reliable way to measure performance, then it's tough to recommend further optimisations.
    $endgroup$
    – paler123
    1 hour ago










  • $begingroup$
    You are right. It is true that I did not specify the benchmark clearly here. I'll update the question.
    $endgroup$
    – Radim Bača
    1 hour ago












  • $begingroup$
    I have updated the question. I hope it is more clear now. Thank you for your recommendations.
    $endgroup$
    – Radim Bača
    49 mins ago










  • $begingroup$
    Concerning your benchmark - it is not large enough and also the queries into the array are not random.
    $endgroup$
    – Radim Bača
    42 mins ago
















$begingroup$
You are using a small array that fits into the L2 cache (the CPU type is unknown). You can try to extend the size of the array to 1M to see the fancy code to win. However, it is still quite a small array and L2 cache misses do not occur significantly often. The number 7 is not important for the performance, it can be any small number lager than 3. I selected 7 since it approximately corresponds to one 64 cache line.
$endgroup$
– Radim Bača
1 hour ago




$begingroup$
You are using a small array that fits into the L2 cache (the CPU type is unknown). You can try to extend the size of the array to 1M to see the fancy code to win. However, it is still quite a small array and L2 cache misses do not occur significantly often. The number 7 is not important for the performance, it can be any small number lager than 3. I selected 7 since it approximately corresponds to one 64 cache line.
$endgroup$
– Radim Bača
1 hour ago




1




1




$begingroup$
Not necessarily, gcc even after increasing the count of elements in the array to 10^6 elements produces better times for basic implementation. What I'm trying to say is that your results will vary depending on your benchmark and if you won't provide reliable way to measure performance, then it's tough to recommend further optimisations.
$endgroup$
– paler123
1 hour ago




$begingroup$
Not necessarily, gcc even after increasing the count of elements in the array to 10^6 elements produces better times for basic implementation. What I'm trying to say is that your results will vary depending on your benchmark and if you won't provide reliable way to measure performance, then it's tough to recommend further optimisations.
$endgroup$
– paler123
1 hour ago












$begingroup$
You are right. It is true that I did not specify the benchmark clearly here. I'll update the question.
$endgroup$
– Radim Bača
1 hour ago






$begingroup$
You are right. It is true that I did not specify the benchmark clearly here. I'll update the question.
$endgroup$
– Radim Bača
1 hour ago














$begingroup$
I have updated the question. I hope it is more clear now. Thank you for your recommendations.
$endgroup$
– Radim Bača
49 mins ago




$begingroup$
I have updated the question. I hope it is more clear now. Thank you for your recommendations.
$endgroup$
– Radim Bača
49 mins ago












$begingroup$
Concerning your benchmark - it is not large enough and also the queries into the array are not random.
$endgroup$
– Radim Bača
42 mins ago




$begingroup$
Concerning your benchmark - it is not large enough and also the queries into the array are not random.
$endgroup$
– Radim Bača
42 mins ago










Radim Bača is a new contributor. Be nice, and check out our Code of Conduct.










draft saved

draft discarded


















Radim Bača is a new contributor. Be nice, and check out our Code of Conduct.













Radim Bača is a new contributor. Be nice, and check out our Code of Conduct.












Radim Bača is a new contributor. Be nice, and check out our Code of Conduct.
















Thanks for contributing an answer to Code Review Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f212088%2fbinary-search-for-students%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

الفوسفات في المغرب

Four equal circles intersect: What is the area of the small shaded portion and its height

بطل الاتحاد السوفيتي