2012-09-15

Comparing floats in bash

This method is by far the most readable bash circumlocution for floating point comparison I've seen:

if (( $( bc <<< "$X < $Y" ) )); then
 echo "summat"
else
 echo "nowt"
fi

First, unlike methods seen elsewhere, the comparison string, "$X < $Y", is placed on bc's standard input using bash's “here string” operator <<< rather than echo in a pipeline or using a temporary variable.

The output of bc is either 0 (true) or 1 (false). This is not an exit status, but the silent arithmetic operator construct ((…)) returns true when the result of the contained operation is zero, false otherwise. This means our $( bc… ) operation can be wrapped in the arithmetic operator: (( $( bc… ) )).

While awk too can return an exit status, the comparison must be brought about with an if-else clause of its own, in addition to that used in bash.

Sure, it would be simpler if bash just implemented floats the way ksh and zsh do, but then, my drill doesn't make coffee. It's a shell, and why am I comparing floats in the first place?

In fact, the reason I wrote the above was to compare NVIDIA driver versions. These version strings only resemble floating point numbers about half the time. While I was dealing with “304.43” upgrading from “304.37”, I’d completely forgotten about “285.05.09” and “270.41.19”.

Note that versioning schemes for other software, even if they use a single dot, don’t necessarily increment consistent with arithmetical rules: 123.45 should be read “one hundred and twenty three dot forty five”, which is prior to 123.104, “one hundred and twenty three dot one hundred and four”. For these cases, we need to operate on the minor part as a ranged-right or zero-padded string. The following is about as general as can be achieved. Assuming no field is greater than 9999, it converts up to four fields of a dot-delimited version string into a sixteen-digit version number:

VER_NUM=$(awk -F. \
  '{printf "%04d%04d%04d%04d",$1,$2,$3,$4}'<<<$VER_STRING)

These may then be compared directly in bash:

if [ $VER_NUM1 -lt $VER_NUM2 ]; then … … … 

Now, while this works, it fails to optimize for the fact this the strings are now identical in length, and may be compared lexicographically if we switch to the double square bracket:

if [[ $VER_NUM1 < $VER_NUM2 ]]; then … … … 

This is becoming more readable, but we’re using awk twice now, compared to using bc once in the floating point version! We can do this kind of thing entirely internal to bash using parameter expansion rules to replace “.” with a space, and the internal printf function to zero-pad each “word”. We'll use it more than once, so put it in a function:

function ver_s2n() {
 printf "%04d%04d%04d%04d\n" ${1//./ }
}

if [[ $(ver_s2n $X) < $(ver_s2n $Y) ]]; then … … … 

Okay, so now it’s fully internal to bash, but it’s still not a one-liner. Combined with the lexicographical comparison, that’s about as far as we can get with general version number comparisons. But what about NVIDIA? It turns out that the version string's second and third fields are always zero-padded to two digits, so they are decimals. In fact, because of the fixed field lengths, we could eliminate the dots using bash parameter expansion rules:

if [ ${X//.} -lt ${Y//.} ]; then … … …

But even better, if we stop thinking in terms of numbers, the version strings including the dots stand up to lexicographical comparison:

if [[ $X < $Y ]]; then … … …

Coming back around to true floating point decimals if we arbitrarily zero-pad the variables, then we can compare using the lexicographical operator.

X1=$(printf "%010f" $X)
Y1=$(printf "%010f" $Y)
if [[ $X1 < $Y1 ]]; then … … …

If such true floats are obtained through some calculation or extracted from a file by an external command, the printf could be added to the pipeline.

The lessons learnt here are applicable to other languages:

  • when you see a floating point number, do not assume it is floating point
  • when you see a number, try operating on it as a string instead
  • look for structure in data which may suggest optimizations

No comments:

Post a Comment