1*9403c583SJens Wiklander 2*9403c583SJens Wiklander<HTML> 3*9403c583SJens Wiklander 4*9403c583SJens Wiklander<HEAD> 5*9403c583SJens Wiklander<TITLE>Berkeley SoftFloat Library Interface</TITLE> 6*9403c583SJens Wiklander</HEAD> 7*9403c583SJens Wiklander 8*9403c583SJens Wiklander<BODY> 9*9403c583SJens Wiklander 10*9403c583SJens Wiklander<H1>Berkeley SoftFloat Release 3a: Library Interface</H1> 11*9403c583SJens Wiklander 12*9403c583SJens Wiklander<P> 13*9403c583SJens WiklanderJohn R. Hauser<BR> 14*9403c583SJens Wiklander2015 October 23<BR> 15*9403c583SJens Wiklander</P> 16*9403c583SJens Wiklander 17*9403c583SJens Wiklander 18*9403c583SJens Wiklander<H2>Contents</H2> 19*9403c583SJens Wiklander 20*9403c583SJens Wiklander<BLOCKQUOTE> 21*9403c583SJens Wiklander<TABLE BORDER=0 CELLSPACING=0 CELLPADDING=0> 22*9403c583SJens Wiklander<COL WIDTH=25> 23*9403c583SJens Wiklander<COL WIDTH=*> 24*9403c583SJens Wiklander<TR><TD COLSPAN=2>1. Introduction</TD></TR> 25*9403c583SJens Wiklander<TR><TD COLSPAN=2>2. Limitations</TD></TR> 26*9403c583SJens Wiklander<TR><TD COLSPAN=2>3. Acknowledgments and License</TD></TR> 27*9403c583SJens Wiklander<TR><TD COLSPAN=2>4. Types and Functions</TD></TR> 28*9403c583SJens Wiklander<TR><TD></TD><TD>4.1. Boolean and Integer Types</TD></TR> 29*9403c583SJens Wiklander<TR><TD></TD><TD>4.2. Floating-Point Types</TD></TR> 30*9403c583SJens Wiklander<TR><TD></TD><TD>4.3. Supported Floating-Point Functions</TD></TR> 31*9403c583SJens Wiklander<TR> 32*9403c583SJens Wiklander <TD></TD> 33*9403c583SJens Wiklander <TD>4.4. Non-canonical Representations in <CODE>extFloat80_t</CODE></TD> 34*9403c583SJens Wiklander</TR> 35*9403c583SJens Wiklander<TR><TD></TD><TD>4.5. Conventions for Passing Arguments and Results</TD></TR> 36*9403c583SJens Wiklander<TR><TD COLSPAN=2>5. Reserved Names</TD></TR> 37*9403c583SJens Wiklander<TR><TD COLSPAN=2>6. Mode Variables</TD></TR> 38*9403c583SJens Wiklander<TR><TD></TD><TD>6.1. Rounding Mode</TD></TR> 39*9403c583SJens Wiklander<TR><TD></TD><TD>6.2. Underflow Detection</TD></TR> 40*9403c583SJens Wiklander<TR> 41*9403c583SJens Wiklander <TD></TD> 42*9403c583SJens Wiklander <TD>6.3. Rounding Precision for the <NOBR>80-Bit</NOBR> Extended Format</TD> 43*9403c583SJens Wiklander</TR> 44*9403c583SJens Wiklander<TR><TD COLSPAN=2>7. Exceptions and Exception Flags</TD></TR> 45*9403c583SJens Wiklander<TR><TD COLSPAN=2>8. Function Details</TD></TR> 46*9403c583SJens Wiklander<TR><TD></TD><TD>8.1. Conversions from Integer to Floating-Point</TD></TR> 47*9403c583SJens Wiklander<TR><TD></TD><TD>8.2. Conversions from Floating-Point to Integer</TD></TR> 48*9403c583SJens Wiklander<TR><TD></TD><TD>8.3. Conversions Among Floating-Point Types</TD></TR> 49*9403c583SJens Wiklander<TR><TD></TD><TD>8.4. Basic Arithmetic Functions</TD></TR> 50*9403c583SJens Wiklander<TR><TD></TD><TD>8.5. Fused Multiply-Add Functions</TD></TR> 51*9403c583SJens Wiklander<TR><TD></TD><TD>8.6. Remainder Functions</TD></TR> 52*9403c583SJens Wiklander<TR><TD></TD><TD>8.7. Round-to-Integer Functions</TD></TR> 53*9403c583SJens Wiklander<TR><TD></TD><TD>8.8. Comparison Functions</TD></TR> 54*9403c583SJens Wiklander<TR><TD></TD><TD>8.9. Signaling NaN Test Functions</TD></TR> 55*9403c583SJens Wiklander<TR><TD></TD><TD>8.10. Raise-Exception Function</TD></TR> 56*9403c583SJens Wiklander<TR><TD COLSPAN=2>9. Changes from SoftFloat <NOBR>Release 2</NOBR></TD></TR> 57*9403c583SJens Wiklander<TR><TD></TD><TD>9.1. Name Changes</TD></TR> 58*9403c583SJens Wiklander<TR><TD></TD><TD>9.2. Changes to Function Arguments</TD></TR> 59*9403c583SJens Wiklander<TR><TD></TD><TD>9.3. Added Capabilities</TD></TR> 60*9403c583SJens Wiklander<TR><TD></TD><TD>9.4. Better Compatibility with the C Language</TD></TR> 61*9403c583SJens Wiklander<TR><TD></TD><TD>9.5. New Organization as a Library</TD></TR> 62*9403c583SJens Wiklander<TR><TD></TD><TD>9.6. Optimization Gains (and Losses)</TD></TR> 63*9403c583SJens Wiklander<TR><TD COLSPAN=2>10. Future Directions</TD></TR> 64*9403c583SJens Wiklander<TR><TD COLSPAN=2>11. Contact Information</TD></TR> 65*9403c583SJens Wiklander</TABLE> 66*9403c583SJens Wiklander</BLOCKQUOTE> 67*9403c583SJens Wiklander 68*9403c583SJens Wiklander 69*9403c583SJens Wiklander<H2>1. Introduction</H2> 70*9403c583SJens Wiklander 71*9403c583SJens Wiklander<P> 72*9403c583SJens WiklanderBerkeley SoftFloat is a software implementation of binary floating-point that 73*9403c583SJens Wiklanderconforms to the IEEE Standard for Floating-Point Arithmetic. 74*9403c583SJens WiklanderThe current release supports four binary formats: <NOBR>32-bit</NOBR> 75*9403c583SJens Wiklandersingle-precision, <NOBR>64-bit</NOBR> double-precision, <NOBR>80-bit</NOBR> 76*9403c583SJens Wiklanderdouble-extended-precision, and <NOBR>128-bit</NOBR> quadruple-precision. 77*9403c583SJens WiklanderThe following functions are supported for each format: 78*9403c583SJens Wiklander<UL> 79*9403c583SJens Wiklander<LI> 80*9403c583SJens Wiklanderaddition, subtraction, multiplication, division, and square root; 81*9403c583SJens Wiklander<LI> 82*9403c583SJens Wiklanderfused multiply-add as defined by the IEEE Standard, except for 83*9403c583SJens Wiklander<NOBR>80-bit</NOBR> double-extended-precision; 84*9403c583SJens Wiklander<LI> 85*9403c583SJens Wiklanderremainder as defined by the IEEE Standard; 86*9403c583SJens Wiklander<LI> 87*9403c583SJens Wiklanderround to integral value; 88*9403c583SJens Wiklander<LI> 89*9403c583SJens Wiklandercomparisons; 90*9403c583SJens Wiklander<LI> 91*9403c583SJens Wiklanderconversions to/from other supported formats; and 92*9403c583SJens Wiklander<LI> 93*9403c583SJens Wiklanderconversions to/from <NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR> integers, 94*9403c583SJens Wiklandersigned and unsigned. 95*9403c583SJens Wiklander</UL> 96*9403c583SJens WiklanderAll operations required by the original 1985 version of the IEEE Floating-Point 97*9403c583SJens WiklanderStandard are implemented, except for conversions to and from decimal. 98*9403c583SJens Wiklander</P> 99*9403c583SJens Wiklander 100*9403c583SJens Wiklander<P> 101*9403c583SJens WiklanderThis document gives information about the types defined and the routines 102*9403c583SJens Wiklanderimplemented by SoftFloat. 103*9403c583SJens WiklanderIt does not attempt to define or explain the IEEE Floating-Point Standard. 104*9403c583SJens WiklanderInformation about the standard is available elsewhere. 105*9403c583SJens Wiklander</P> 106*9403c583SJens Wiklander 107*9403c583SJens Wiklander<P> 108*9403c583SJens WiklanderThe current version of SoftFloat is <NOBR>Release 3a</NOBR>. 109*9403c583SJens WiklanderThe only difference between this version and the previous 110*9403c583SJens Wiklander<NOBR>Release 3</NOBR> is the replacement of the license text supplied by the 111*9403c583SJens WiklanderUniversity of California. 112*9403c583SJens Wiklander</P> 113*9403c583SJens Wiklander 114*9403c583SJens Wiklander<P> 115*9403c583SJens WiklanderThe functional interface of SoftFloat <NOBR>Release 3</NOBR> and afterward 116*9403c583SJens Wiklanderdiffers in many details from that of earlier releases. 117*9403c583SJens WiklanderFor specifics of these differences, see <NOBR>section 9</NOBR> below, 118*9403c583SJens Wiklander<I>Changes from SoftFloat <NOBR>Release 2</NOBR></I>. 119*9403c583SJens Wiklander</P> 120*9403c583SJens Wiklander 121*9403c583SJens Wiklander 122*9403c583SJens Wiklander<H2>2. Limitations</H2> 123*9403c583SJens Wiklander 124*9403c583SJens Wiklander<P> 125*9403c583SJens WiklanderSoftFloat assumes the computer has an addressable byte size of 8 or 126*9403c583SJens Wiklander<NOBR>16 bits</NOBR>. 127*9403c583SJens Wiklander(Nearly all computers in use today have <NOBR>8-bit</NOBR> bytes.) 128*9403c583SJens Wiklander</P> 129*9403c583SJens Wiklander 130*9403c583SJens Wiklander<P> 131*9403c583SJens WiklanderSoftFloat is written in C and is designed to work with other C code. 132*9403c583SJens WiklanderThe C compiler used must conform at a minimum to the 1989 ANSI standard for the 133*9403c583SJens WiklanderC language (same as the 1990 ISO standard) and must in addition support basic 134*9403c583SJens Wiklanderarithmetic on <NOBR>64-bit</NOBR> integers. 135*9403c583SJens WiklanderEarlier releases of SoftFloat included implementations of <NOBR>32-bit</NOBR> 136*9403c583SJens Wiklandersingle-precision and <NOBR>64-bit</NOBR> double-precision floating-point that 137*9403c583SJens Wiklanderdid not require <NOBR>64-bit</NOBR> integers, but this option is not supported 138*9403c583SJens Wiklanderstarting with <NOBR>Release 3</NOBR>. 139*9403c583SJens WiklanderSince 1999, ISO standards for C have mandated compiler support for 140*9403c583SJens Wiklander<NOBR>64-bit</NOBR> integers. 141*9403c583SJens WiklanderA compiler conforming to the 1999 C Standard or later is recommended but not 142*9403c583SJens Wiklanderstrictly required. 143*9403c583SJens Wiklander</P> 144*9403c583SJens Wiklander 145*9403c583SJens Wiklander<P> 146*9403c583SJens WiklanderMost operations not required by the original 1985 version of the IEEE 147*9403c583SJens WiklanderFloating-Point Standard but added in the 2008 version are not yet supported in 148*9403c583SJens WiklanderSoftFloat <NOBR>Release 3a</NOBR>. 149*9403c583SJens Wiklander</P> 150*9403c583SJens Wiklander 151*9403c583SJens Wiklander 152*9403c583SJens Wiklander<H2>3. Acknowledgments and License</H2> 153*9403c583SJens Wiklander 154*9403c583SJens Wiklander<P> 155*9403c583SJens WiklanderThe SoftFloat package was written by me, <NOBR>John R.</NOBR> Hauser. 156*9403c583SJens Wiklander<NOBR>Release 3</NOBR> of SoftFloat was a completely new implementation 157*9403c583SJens Wiklandersupplanting earlier releases. 158*9403c583SJens WiklanderThe project to create <NOBR>Release 3</NOBR> (and <NOBR>now 3a</NOBR>) was done 159*9403c583SJens Wiklanderin the employ of the University of California, Berkeley, within the Department 160*9403c583SJens Wiklanderof Electrical Engineering and Computer Sciences, first for the Parallel 161*9403c583SJens WiklanderComputing Laboratory (Par Lab) and then for the ASPIRE Lab. 162*9403c583SJens WiklanderThe work was officially overseen by Prof. Krste Asanovic, with funding provided 163*9403c583SJens Wiklanderby these sources: 164*9403c583SJens Wiklander<BLOCKQUOTE> 165*9403c583SJens Wiklander<TABLE> 166*9403c583SJens Wiklander<COL> 167*9403c583SJens Wiklander<COL WIDTH=10> 168*9403c583SJens Wiklander<COL> 169*9403c583SJens Wiklander<TR> 170*9403c583SJens Wiklander<TD VALIGN=TOP><NOBR>Par Lab:</NOBR></TD> 171*9403c583SJens Wiklander<TD></TD> 172*9403c583SJens Wiklander<TD> 173*9403c583SJens WiklanderMicrosoft (Award #024263), Intel (Award #024894), and U.C. Discovery 174*9403c583SJens Wiklander(Award #DIG07-10227), with additional support from Par Lab affiliates Nokia, 175*9403c583SJens WiklanderNVIDIA, Oracle, and Samsung. 176*9403c583SJens Wiklander</TD> 177*9403c583SJens Wiklander</TR> 178*9403c583SJens Wiklander<TR> 179*9403c583SJens Wiklander<TD VALIGN=TOP><NOBR>ASPIRE Lab:</NOBR></TD> 180*9403c583SJens Wiklander<TD></TD> 181*9403c583SJens Wiklander<TD> 182*9403c583SJens WiklanderDARPA PERFECT program (Award #HR0011-12-2-0016), with additional support from 183*9403c583SJens WiklanderASPIRE industrial sponsor Intel and ASPIRE affiliates Google, Nokia, NVIDIA, 184*9403c583SJens WiklanderOracle, and Samsung. 185*9403c583SJens Wiklander</TD> 186*9403c583SJens Wiklander</TR> 187*9403c583SJens Wiklander</TABLE> 188*9403c583SJens Wiklander</BLOCKQUOTE> 189*9403c583SJens Wiklander</P> 190*9403c583SJens Wiklander 191*9403c583SJens Wiklander<P> 192*9403c583SJens WiklanderThe following applies to the whole of SoftFloat <NOBR>Release 3a</NOBR> as well 193*9403c583SJens Wiklanderas to each source file individually. 194*9403c583SJens Wiklander</P> 195*9403c583SJens Wiklander 196*9403c583SJens Wiklander<P> 197*9403c583SJens WiklanderCopyright 2011, 2012, 2013, 2014, 2015 The Regents of the University of 198*9403c583SJens WiklanderCalifornia. 199*9403c583SJens WiklanderAll rights reserved. 200*9403c583SJens Wiklander</P> 201*9403c583SJens Wiklander 202*9403c583SJens Wiklander<P> 203*9403c583SJens WiklanderRedistribution and use in source and binary forms, with or without 204*9403c583SJens Wiklandermodification, are permitted provided that the following conditions are met: 205*9403c583SJens Wiklander<OL> 206*9403c583SJens Wiklander 207*9403c583SJens Wiklander<LI> 208*9403c583SJens Wiklander<P> 209*9403c583SJens WiklanderRedistributions of source code must retain the above copyright notice, this 210*9403c583SJens Wiklanderlist of conditions, and the following disclaimer. 211*9403c583SJens Wiklander</P> 212*9403c583SJens Wiklander 213*9403c583SJens Wiklander<LI> 214*9403c583SJens Wiklander<P> 215*9403c583SJens WiklanderRedistributions in binary form must reproduce the above copyright notice, this 216*9403c583SJens Wiklanderlist of conditions, and the following disclaimer in the documentation and/or 217*9403c583SJens Wiklanderother materials provided with the distribution. 218*9403c583SJens Wiklander</P> 219*9403c583SJens Wiklander 220*9403c583SJens Wiklander<LI> 221*9403c583SJens Wiklander<P> 222*9403c583SJens WiklanderNeither the name of the University nor the names of its contributors may be 223*9403c583SJens Wiklanderused to endorse or promote products derived from this software without specific 224*9403c583SJens Wiklanderprior written permission. 225*9403c583SJens Wiklander</P> 226*9403c583SJens Wiklander 227*9403c583SJens Wiklander</OL> 228*9403c583SJens Wiklander</P> 229*9403c583SJens Wiklander 230*9403c583SJens Wiklander<P> 231*9403c583SJens WiklanderTHIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS “AS IS”, 232*9403c583SJens WiklanderAND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 233*9403c583SJens WiklanderIMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, ARE 234*9403c583SJens WiklanderDISCLAIMED. 235*9403c583SJens WiklanderIN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, 236*9403c583SJens WiklanderINDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, 237*9403c583SJens WiklanderBUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 238*9403c583SJens WiklanderDATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 239*9403c583SJens WiklanderLIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE 240*9403c583SJens WiklanderOR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 241*9403c583SJens WiklanderADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 242*9403c583SJens Wiklander</P> 243*9403c583SJens Wiklander 244*9403c583SJens Wiklander 245*9403c583SJens Wiklander<H2>4. Types and Functions</H2> 246*9403c583SJens Wiklander 247*9403c583SJens Wiklander<P> 248*9403c583SJens WiklanderThe types and functions of SoftFloat are declared in header file 249*9403c583SJens Wiklander<CODE>softfloat.h</CODE>. 250*9403c583SJens Wiklander</P> 251*9403c583SJens Wiklander 252*9403c583SJens Wiklander<H3>4.1. Boolean and Integer Types</H3> 253*9403c583SJens Wiklander 254*9403c583SJens Wiklander<P> 255*9403c583SJens WiklanderHeader file <CODE>softfloat.h</CODE> depends on standard headers 256*9403c583SJens Wiklander<CODE><stdbool.h></CODE> and <CODE><stdint.h></CODE> to define type 257*9403c583SJens Wiklander<CODE>bool</CODE> and several integer types. 258*9403c583SJens WiklanderThese standard headers have been part of the ISO C Standard Library since 1999. 259*9403c583SJens WiklanderWith any recent compiler, they are likely to be supported, even if the compiler 260*9403c583SJens Wiklanderdoes not claim complete conformance to the ISO C Standard. 261*9403c583SJens WiklanderFor older or nonstandard compilers, a port of SoftFloat may have substitutes 262*9403c583SJens Wiklanderfor these headers. 263*9403c583SJens WiklanderHeader <CODE>softfloat.h</CODE> depends only on the name <CODE>bool</CODE> from 264*9403c583SJens Wiklander<CODE><stdbool.h></CODE> and on these type names from 265*9403c583SJens Wiklander<CODE><stdint.h></CODE>: 266*9403c583SJens Wiklander<BLOCKQUOTE> 267*9403c583SJens Wiklander<PRE> 268*9403c583SJens Wiklanderuint16_t 269*9403c583SJens Wiklanderuint32_t 270*9403c583SJens Wiklanderuint64_t 271*9403c583SJens Wiklanderint32_t 272*9403c583SJens Wiklanderint64_t 273*9403c583SJens Wiklanderuint_fast8_t 274*9403c583SJens Wiklanderuint_fast32_t 275*9403c583SJens Wiklanderuint_fast64_t 276*9403c583SJens Wiklander</PRE> 277*9403c583SJens Wiklander</BLOCKQUOTE> 278*9403c583SJens Wiklander</P> 279*9403c583SJens Wiklander 280*9403c583SJens Wiklander 281*9403c583SJens Wiklander<H3>4.2. Floating-Point Types</H3> 282*9403c583SJens Wiklander 283*9403c583SJens Wiklander<P> 284*9403c583SJens WiklanderThe <CODE>softfloat.h</CODE> header defines four floating-point types: 285*9403c583SJens Wiklander<BLOCKQUOTE> 286*9403c583SJens Wiklander<TABLE CELLSPACING=0 CELLPADDING=0> 287*9403c583SJens Wiklander<TR> 288*9403c583SJens Wiklander<TD><CODE>float32_t</CODE></TD> 289*9403c583SJens Wiklander<TD><NOBR>32-bit</NOBR> single-precision binary format</TD> 290*9403c583SJens Wiklander</TR> 291*9403c583SJens Wiklander<TR> 292*9403c583SJens Wiklander<TD><CODE>float64_t</CODE></TD> 293*9403c583SJens Wiklander<TD><NOBR>64-bit</NOBR> double-precision binary format</TD> 294*9403c583SJens Wiklander</TR> 295*9403c583SJens Wiklander<TR> 296*9403c583SJens Wiklander<TD><CODE>extFloat80_t </CODE></TD> 297*9403c583SJens Wiklander<TD><NOBR>80-bit</NOBR> double-extended-precision binary format (old Intel or 298*9403c583SJens WiklanderMotorola format)</TD> 299*9403c583SJens Wiklander</TR> 300*9403c583SJens Wiklander<TR> 301*9403c583SJens Wiklander<TD><CODE>float128_t</CODE></TD> 302*9403c583SJens Wiklander<TD><NOBR>128-bit</NOBR> quadruple-precision binary format</TD> 303*9403c583SJens Wiklander</TR> 304*9403c583SJens Wiklander</TABLE> 305*9403c583SJens Wiklander</BLOCKQUOTE> 306*9403c583SJens WiklanderThe non-extended types are each exactly the size specified: 307*9403c583SJens Wiklander<NOBR>32 bits</NOBR> for <CODE>float32_t</CODE>, <NOBR>64 bits</NOBR> for 308*9403c583SJens Wiklander<CODE>float64_t</CODE>, and <NOBR>128 bits</NOBR> for <CODE>float128_t</CODE>. 309*9403c583SJens WiklanderAside from these size requirements, the definitions of all these types may 310*9403c583SJens Wiklanderdiffer for different ports of SoftFloat to specific systems. 311*9403c583SJens WiklanderA given port of SoftFloat may or may not define some of the floating-point 312*9403c583SJens Wiklandertypes as aliases for the C standard types <CODE>float</CODE>, 313*9403c583SJens Wiklander<CODE>double</CODE>, and <CODE>long</CODE> <CODE>double</CODE>. 314*9403c583SJens Wiklander</P> 315*9403c583SJens Wiklander 316*9403c583SJens Wiklander<P> 317*9403c583SJens WiklanderHeader file <CODE>softfloat.h</CODE> also defines a structure, 318*9403c583SJens Wiklander<CODE>struct</CODE> <CODE>extFloat80M</CODE>, for the representation of 319*9403c583SJens Wiklander<NOBR>80-bit</NOBR> double-extended-precision floating-point values in memory. 320*9403c583SJens WiklanderThis structure is the same size as type <CODE>extFloat80_t</CODE> and contains 321*9403c583SJens Wiklanderat least these two fields (not necessarily in this order): 322*9403c583SJens Wiklander<BLOCKQUOTE> 323*9403c583SJens Wiklander<PRE> 324*9403c583SJens Wiklanderuint16_t signExp; 325*9403c583SJens Wiklanderuint64_t signif; 326*9403c583SJens Wiklander</PRE> 327*9403c583SJens Wiklander</BLOCKQUOTE> 328*9403c583SJens WiklanderField <CODE>signExp</CODE> contains the sign and exponent of the floating-point 329*9403c583SJens Wiklandervalue, with the sign in the most significant bit (<NOBR>bit 15</NOBR>) and the 330*9403c583SJens Wiklanderencoded exponent in the other <NOBR>15 bits</NOBR>. 331*9403c583SJens WiklanderField <CODE>signif</CODE> is the complete <NOBR>64-bit</NOBR> significand of 332*9403c583SJens Wiklanderthe floating-point value. 333*9403c583SJens Wiklander(In the usual encoding for <NOBR>80-bit</NOBR> extended floating-point, the 334*9403c583SJens Wiklanderleading <NOBR>1 bit</NOBR> of normalized numbers is not implicit but is stored 335*9403c583SJens Wiklanderin the most significant bit of the significand.) 336*9403c583SJens Wiklander</P> 337*9403c583SJens Wiklander 338*9403c583SJens Wiklander<H3>4.3. Supported Floating-Point Functions</H3> 339*9403c583SJens Wiklander 340*9403c583SJens Wiklander<P> 341*9403c583SJens WiklanderSoftFloat implements these arithmetic operations for its floating-point types: 342*9403c583SJens Wiklander<UL> 343*9403c583SJens Wiklander<LI> 344*9403c583SJens Wiklanderconversions between any two floating-point formats; 345*9403c583SJens Wiklander<LI> 346*9403c583SJens Wiklanderfor each floating-point format, conversions to and from signed and unsigned 347*9403c583SJens Wiklander<NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR> integers; 348*9403c583SJens Wiklander<LI> 349*9403c583SJens Wiklanderfor each format, the usual addition, subtraction, multiplication, division, and 350*9403c583SJens Wiklandersquare root operations; 351*9403c583SJens Wiklander<LI> 352*9403c583SJens Wiklanderfor each format except <CODE>extFloat80_t</CODE>, the fused multiply-add 353*9403c583SJens Wiklanderoperation defined by the IEEE Standard; 354*9403c583SJens Wiklander<LI> 355*9403c583SJens Wiklanderfor each format, the floating-point remainder operation defined by the IEEE 356*9403c583SJens WiklanderStandard; 357*9403c583SJens Wiklander<LI> 358*9403c583SJens Wiklanderfor each format, a “round to integer” operation that rounds to the 359*9403c583SJens Wiklandernearest integer value in the same format; and 360*9403c583SJens Wiklander<LI> 361*9403c583SJens Wiklandercomparisons between two values in the same floating-point format. 362*9403c583SJens Wiklander</UL> 363*9403c583SJens Wiklander</P> 364*9403c583SJens Wiklander 365*9403c583SJens Wiklander<P> 366*9403c583SJens WiklanderThe following operations required by the 2008 IEEE Floating-Point Standard are 367*9403c583SJens Wiklandernot supported in SoftFloat <NOBR>Release 3a</NOBR>: 368*9403c583SJens Wiklander<UL> 369*9403c583SJens Wiklander<LI> 370*9403c583SJens Wiklander<B>nextUp</B>, <B>nextDown</B>, <B>minNum</B>, <B>maxNum</B>, <B>minNumMag</B>, 371*9403c583SJens Wiklander<B>maxNumMag</B>, <B>scaleB</B>, and <B>logB</B>; 372*9403c583SJens Wiklander<LI> 373*9403c583SJens Wiklanderconversions between floating-point formats and decimal or hexadecimal character 374*9403c583SJens Wiklandersequences; 375*9403c583SJens Wiklander<LI> 376*9403c583SJens Wiklanderall “quiet-computation” operations (<B>copy</B>, <B>negate</B>, 377*9403c583SJens Wiklander<B>abs</B>, and <B>copySign</B>, which all involve only simple copying and/or 378*9403c583SJens Wiklandermanipulation of the floating-point sign bit); and 379*9403c583SJens Wiklander<LI> 380*9403c583SJens Wiklanderall “non-computational” operations other than <B>isSignaling</B> 381*9403c583SJens Wiklander(which is supported). 382*9403c583SJens Wiklander</UL> 383*9403c583SJens Wiklander</P> 384*9403c583SJens Wiklander 385*9403c583SJens Wiklander<H3>4.4. Non-canonical Representations in <CODE>extFloat80_t</CODE></H3> 386*9403c583SJens Wiklander 387*9403c583SJens Wiklander<P> 388*9403c583SJens WiklanderBecause the <NOBR>80-bit</NOBR> double-extended-precision format, 389*9403c583SJens Wiklander<CODE>extFloat80_t</CODE>, stores an explicit leading significand bit, many 390*9403c583SJens Wiklanderfloating-point numbers are encodable in this type in equivalent normalized and 391*9403c583SJens Wiklanderdenormalized forms. 392*9403c583SJens WiklanderZeros and values in the subnormal range have each only a single possible 393*9403c583SJens Wiklanderencoding, for which the leading significand bit must <NOBR>be 0</NOBR>. 394*9403c583SJens WiklanderFor other finite values (outside the subnormal range), a unique normalized 395*9403c583SJens Wiklanderrepresentation, with leading significand bit set <NOBR>to 1</NOBR>, always 396*9403c583SJens Wiklanderexists, and is considered the <I>canonical</I> representation of the value. 397*9403c583SJens WiklanderAny equivalent denormalized representations (having leading significand bit 398*9403c583SJens Wiklander<NOBR>of 0</NOBR>) are <I>non-canonical</I>. 399*9403c583SJens WiklanderSimilarly, the leading significand bit is expected to <NOBR>be 1</NOBR> for 400*9403c583SJens Wiklanderinfinities and NaNs as well; 401*9403c583SJens Wiklanderany infinity or NaN with a leading significand bit <NOBR>of 0</NOBR> is again 402*9403c583SJens Wiklanderconsidered non-canonical. 403*9403c583SJens WiklanderIn short, for an <CODE>extFloat80_t</CODE> representation to be canonical, the 404*9403c583SJens Wiklanderleading significand bit must <NOBR>be 1</NOBR> unless it is required to 405*9403c583SJens Wiklander<NOBR>be 0</NOBR> because the encoded value is zero or a subnormal. 406*9403c583SJens Wiklander</P> 407*9403c583SJens Wiklander 408*9403c583SJens Wiklander<P> 409*9403c583SJens WiklanderFunctions are not guaranteed to operate as expected when inputs of type 410*9403c583SJens Wiklander<CODE>extFloat80_t</CODE> are non-canonical. 411*9403c583SJens WiklanderAssuming all of a function’s <CODE>extFloat80_t</CODE> inputs (if any) 412*9403c583SJens Wiklanderare canonical, function outputs of type <CODE>extFloat80_t</CODE> will always 413*9403c583SJens Wiklanderbe canonical. 414*9403c583SJens Wiklander</P> 415*9403c583SJens Wiklander 416*9403c583SJens Wiklander<H3>4.5. Conventions for Passing Arguments and Results</H3> 417*9403c583SJens Wiklander 418*9403c583SJens Wiklander<P> 419*9403c583SJens WiklanderValues that are at most <NOBR>64 bits</NOBR> in size (i.e., not the 420*9403c583SJens Wiklander<NOBR>80-bit</NOBR> or <NOBR>128-bit</NOBR> floating-point formats) are in all 421*9403c583SJens Wiklandercases passed as function arguments by value. 422*9403c583SJens WiklanderLikewise, when an output of a function is no more than <NOBR>64 bits</NOBR>, it 423*9403c583SJens Wiklanderis always returned directly as the function result. 424*9403c583SJens WiklanderThus, for example, the SoftFloat function for adding two <NOBR>64-bit</NOBR> 425*9403c583SJens Wiklanderfloating-point values has this simple signature: 426*9403c583SJens Wiklander<BLOCKQUOTE> 427*9403c583SJens Wiklander<CODE>float64_t f64_add( float64_t, float64_t );</CODE> 428*9403c583SJens Wiklander</BLOCKQUOTE> 429*9403c583SJens Wiklander</P> 430*9403c583SJens Wiklander 431*9403c583SJens Wiklander<P> 432*9403c583SJens WiklanderThe story is more complex when function inputs and outputs are 433*9403c583SJens Wiklander<NOBR>80-bit</NOBR> and <NOBR>128-bit</NOBR> floating-point. 434*9403c583SJens WiklanderFor these types, SoftFloat always provides a function that passes these larger 435*9403c583SJens Wiklandervalues into or out of the function indirectly, via pointers. 436*9403c583SJens WiklanderFor example, for adding two <NOBR>128-bit</NOBR> floating-point values, 437*9403c583SJens WiklanderSoftFloat supplies this function: 438*9403c583SJens Wiklander<BLOCKQUOTE> 439*9403c583SJens Wiklander<CODE>void f128M_add( const float128_t *, const float128_t *, float128_t * );</CODE> 440*9403c583SJens Wiklander</BLOCKQUOTE> 441*9403c583SJens WiklanderThe first two arguments point to the values to be added, and the last argument 442*9403c583SJens Wiklanderpoints to the location where the sum will be stored. 443*9403c583SJens WiklanderThe <CODE>M</CODE> in the name <CODE>f128M_add</CODE> is mnemonic for the fact 444*9403c583SJens Wiklanderthat the <NOBR>128-bit</NOBR> inputs and outputs are “in memory”, 445*9403c583SJens Wiklanderpointed to by pointer arguments. 446*9403c583SJens Wiklander</P> 447*9403c583SJens Wiklander 448*9403c583SJens Wiklander<P> 449*9403c583SJens WiklanderAll ports of SoftFloat implement these <I>pass-by-pointer</I> functions for 450*9403c583SJens Wiklandertypes <CODE>extFloat80_t</CODE> and <CODE>float128_t</CODE>. 451*9403c583SJens WiklanderAt the same time, SoftFloat ports may also implement alternate versions of 452*9403c583SJens Wiklanderthese same functions that pass <CODE>extFloat80_t</CODE> and 453*9403c583SJens Wiklander<CODE>float128_t</CODE> by value, like the smaller formats. 454*9403c583SJens WiklanderThus, besides the function with name <CODE>f128M_add</CODE> shown above, a 455*9403c583SJens WiklanderSoftFloat port may also supply an equivalent function with this signature: 456*9403c583SJens Wiklander<BLOCKQUOTE> 457*9403c583SJens Wiklander<CODE>float128_t f128_add( float128_t, float128_t );</CODE> 458*9403c583SJens Wiklander</BLOCKQUOTE> 459*9403c583SJens Wiklander</P> 460*9403c583SJens Wiklander 461*9403c583SJens Wiklander<P> 462*9403c583SJens WiklanderAs a general rule, on computers where the machine word size is 463*9403c583SJens Wiklander<NOBR>32 bits</NOBR> or smaller, only the pass-by-pointer versions of functions 464*9403c583SJens Wiklander(e.g., <CODE>f128M_add</CODE>) are provided for types <CODE>extFloat80_t</CODE> 465*9403c583SJens Wiklanderand <CODE>float128_t</CODE>, because passing such large types directly can have 466*9403c583SJens Wiklandersignificant extra cost. 467*9403c583SJens WiklanderOn computers where the word size is <NOBR>64 bits</NOBR> or larger, both 468*9403c583SJens Wiklanderfunction versions (<CODE>f128M_add</CODE> and <CODE>f128_add</CODE>) are 469*9403c583SJens Wiklanderprovided, because the cost of passing by value is then more reasonable. 470*9403c583SJens WiklanderApplications that must be portable accross both classes of computers must use 471*9403c583SJens Wiklanderthe pointer-based functions, as these are always implemented. 472*9403c583SJens WiklanderHowever, if it is known that SoftFloat includes the by-value functions for all 473*9403c583SJens Wiklanderplatforms of interest, programmers can use whichever version they prefer. 474*9403c583SJens Wiklander</P> 475*9403c583SJens Wiklander 476*9403c583SJens Wiklander 477*9403c583SJens Wiklander<H2>5. Reserved Names</H2> 478*9403c583SJens Wiklander 479*9403c583SJens Wiklander<P> 480*9403c583SJens WiklanderIn addition to the variables and functions documented here, SoftFloat defines 481*9403c583SJens Wiklandersome symbol names for its own private use. 482*9403c583SJens WiklanderThese private names always begin with the prefix 483*9403c583SJens Wiklander‘<CODE>softfloat_</CODE>’. 484*9403c583SJens WiklanderWhen a program includes header <CODE>softfloat.h</CODE> or links with the 485*9403c583SJens WiklanderSoftFloat library, all names with prefix ‘<CODE>softfloat_</CODE>’ 486*9403c583SJens Wiklanderare reserved for possible use by SoftFloat. 487*9403c583SJens WiklanderApplications that use SoftFloat should not define their own names with this 488*9403c583SJens Wiklanderprefix, and should reference only such names as are documented. 489*9403c583SJens Wiklander</P> 490*9403c583SJens Wiklander 491*9403c583SJens Wiklander 492*9403c583SJens Wiklander<H2>6. Mode Variables</H2> 493*9403c583SJens Wiklander 494*9403c583SJens Wiklander<P> 495*9403c583SJens WiklanderThe following variables control rounding mode, underflow detection, and the 496*9403c583SJens Wiklander<NOBR>80-bit</NOBR> extended format’s rounding precision: 497*9403c583SJens Wiklander<BLOCKQUOTE> 498*9403c583SJens Wiklander<CODE>softfloat_roundingMode</CODE><BR> 499*9403c583SJens Wiklander<CODE>softfloat_detectTininess</CODE><BR> 500*9403c583SJens Wiklander<CODE>extF80_roundingPrecision</CODE> 501*9403c583SJens Wiklander</BLOCKQUOTE> 502*9403c583SJens WiklanderThese mode variables are covered in the next several subsections. 503*9403c583SJens Wiklander</P> 504*9403c583SJens Wiklander 505*9403c583SJens Wiklander<H3>6.1. Rounding Mode</H3> 506*9403c583SJens Wiklander 507*9403c583SJens Wiklander<P> 508*9403c583SJens WiklanderAll five rounding modes defined by the 2008 IEEE Floating-Point Standard are 509*9403c583SJens Wiklanderimplemented for all operations that require rounding. 510*9403c583SJens WiklanderThe rounding mode is selected by the global variable 511*9403c583SJens Wiklander<BLOCKQUOTE> 512*9403c583SJens Wiklander<CODE>uint_fast8_t softfloat_roundingMode;</CODE> 513*9403c583SJens Wiklander</BLOCKQUOTE> 514*9403c583SJens WiklanderThis variable may be set to one of the values 515*9403c583SJens Wiklander<BLOCKQUOTE> 516*9403c583SJens Wiklander<TABLE CELLSPACING=0 CELLPADDING=0> 517*9403c583SJens Wiklander<TR> 518*9403c583SJens Wiklander<TD><CODE>softfloat_round_near_even</CODE></TD> 519*9403c583SJens Wiklander<TD>round to nearest, with ties to even</TD> 520*9403c583SJens Wiklander</TR> 521*9403c583SJens Wiklander<TR> 522*9403c583SJens Wiklander<TD><CODE>softfloat_round_near_maxMag </CODE></TD> 523*9403c583SJens Wiklander<TD>round to nearest, with ties to maximum magnitude (away from zero)</TD> 524*9403c583SJens Wiklander</TR> 525*9403c583SJens Wiklander<TR> 526*9403c583SJens Wiklander<TD><CODE>softfloat_round_minMag</CODE></TD> 527*9403c583SJens Wiklander<TD>round to minimum magnitude (toward zero)</TD> 528*9403c583SJens Wiklander</TR> 529*9403c583SJens Wiklander<TR> 530*9403c583SJens Wiklander<TD><CODE>softfloat_round_min</CODE></TD> 531*9403c583SJens Wiklander<TD>round to minimum (down)</TD> 532*9403c583SJens Wiklander</TR> 533*9403c583SJens Wiklander<TR> 534*9403c583SJens Wiklander<TD><CODE>softfloat_round_max</CODE></TD> 535*9403c583SJens Wiklander<TD>round to maximum (up)</TD> 536*9403c583SJens Wiklander</TR> 537*9403c583SJens Wiklander</TABLE> 538*9403c583SJens Wiklander</BLOCKQUOTE> 539*9403c583SJens WiklanderVariable <CODE>softfloat_roundingMode</CODE> is initialized to 540*9403c583SJens Wiklander<CODE>softfloat_round_near_even</CODE>. 541*9403c583SJens Wiklander</P> 542*9403c583SJens Wiklander 543*9403c583SJens Wiklander<H3>6.2. Underflow Detection</H3> 544*9403c583SJens Wiklander 545*9403c583SJens Wiklander<P> 546*9403c583SJens WiklanderIn the terminology of the IEEE Standard, SoftFloat can detect tininess for 547*9403c583SJens Wiklanderunderflow either before or after rounding. 548*9403c583SJens WiklanderThe choice is made by the global variable 549*9403c583SJens Wiklander<BLOCKQUOTE> 550*9403c583SJens Wiklander<CODE>uint_fast8_t softfloat_detectTininess;</CODE> 551*9403c583SJens Wiklander</BLOCKQUOTE> 552*9403c583SJens Wiklanderwhich can be set to either 553*9403c583SJens Wiklander<BLOCKQUOTE> 554*9403c583SJens Wiklander<CODE>softfloat_tininess_beforeRounding</CODE><BR> 555*9403c583SJens Wiklander<CODE>softfloat_tininess_afterRounding</CODE> 556*9403c583SJens Wiklander</BLOCKQUOTE> 557*9403c583SJens WiklanderDetecting tininess after rounding is better because it results in fewer 558*9403c583SJens Wiklanderspurious underflow signals. 559*9403c583SJens WiklanderThe other option is provided for compatibility with some systems. 560*9403c583SJens WiklanderLike most systems (and as required by the newer 2008 IEEE Standard), SoftFloat 561*9403c583SJens Wiklanderalways detects loss of accuracy for underflow as an inexact result. 562*9403c583SJens Wiklander</P> 563*9403c583SJens Wiklander 564*9403c583SJens Wiklander<H3>6.3. Rounding Precision for the <NOBR>80-Bit</NOBR> Extended Format</H3> 565*9403c583SJens Wiklander 566*9403c583SJens Wiklander<P> 567*9403c583SJens WiklanderFor <CODE>extFloat80_t</CODE> only, the rounding precision of the basic 568*9403c583SJens Wiklanderarithmetic operations is controlled by the global variable 569*9403c583SJens Wiklander<BLOCKQUOTE> 570*9403c583SJens Wiklander<CODE>uint_fast8_t extF80_roundingPrecision;</CODE> 571*9403c583SJens Wiklander</BLOCKQUOTE> 572*9403c583SJens WiklanderThe operations affected are: 573*9403c583SJens Wiklander<BLOCKQUOTE> 574*9403c583SJens Wiklander<CODE>extF80_add</CODE><BR> 575*9403c583SJens Wiklander<CODE>extF80_sub</CODE><BR> 576*9403c583SJens Wiklander<CODE>extF80_mul</CODE><BR> 577*9403c583SJens Wiklander<CODE>extF80_div</CODE><BR> 578*9403c583SJens Wiklander<CODE>extF80_sqrt</CODE> 579*9403c583SJens Wiklander</BLOCKQUOTE> 580*9403c583SJens WiklanderWhen <CODE>extF80_roundingPrecision</CODE> is set to its default value of 80, 581*9403c583SJens Wiklanderthese operations are rounded to the full precision of the <NOBR>80-bit</NOBR> 582*9403c583SJens Wiklanderdouble-extended-precision format, like occurs for other formats. 583*9403c583SJens WiklanderSetting <CODE>extF80_roundingPrecision</CODE> to 32 or to 64 causes the 584*9403c583SJens Wiklanderoperations listed to be rounded to <NOBR>32-bit</NOBR> precision (equivalent to 585*9403c583SJens Wiklander<CODE>float32_t</CODE>) or to <NOBR>64-bit</NOBR> precision (equivalent to 586*9403c583SJens Wiklander<CODE>float64_t</CODE>), respectively. 587*9403c583SJens WiklanderWhen rounding to reduced precision, additional bits in the result significand 588*9403c583SJens Wiklanderbeyond the rounding point are set to zero. 589*9403c583SJens WiklanderThe consequences of setting <CODE>extF80_roundingPrecision</CODE> to a value 590*9403c583SJens Wiklanderother than 32, 64, or 80 is not specified. 591*9403c583SJens WiklanderOperations other than the ones listed above are not affected by 592*9403c583SJens Wiklander<CODE>extF80_roundingPrecision</CODE>. 593*9403c583SJens Wiklander</P> 594*9403c583SJens Wiklander 595*9403c583SJens Wiklander 596*9403c583SJens Wiklander<H2>7. Exceptions and Exception Flags</H2> 597*9403c583SJens Wiklander 598*9403c583SJens Wiklander<P> 599*9403c583SJens WiklanderAll five exception flags required by the IEEE Floating-Point Standard are 600*9403c583SJens Wiklanderimplemented. 601*9403c583SJens WiklanderEach flag is stored as a separate bit in the global variable 602*9403c583SJens Wiklander<BLOCKQUOTE> 603*9403c583SJens Wiklander<CODE>uint_fast8_t softfloat_exceptionFlags;</CODE> 604*9403c583SJens Wiklander</BLOCKQUOTE> 605*9403c583SJens WiklanderThe positions of the exception flag bits within this variable are determined by 606*9403c583SJens Wiklanderthe bit masks 607*9403c583SJens Wiklander<BLOCKQUOTE> 608*9403c583SJens Wiklander<CODE>softfloat_flag_inexact</CODE><BR> 609*9403c583SJens Wiklander<CODE>softfloat_flag_underflow</CODE><BR> 610*9403c583SJens Wiklander<CODE>softfloat_flag_overflow</CODE><BR> 611*9403c583SJens Wiklander<CODE>softfloat_flag_infinite</CODE><BR> 612*9403c583SJens Wiklander<CODE>softfloat_flag_invalid</CODE> 613*9403c583SJens Wiklander</BLOCKQUOTE> 614*9403c583SJens WiklanderVariable <CODE>softfloat_exceptionFlags</CODE> is initialized to all zeros, 615*9403c583SJens Wiklandermeaning no exceptions. 616*9403c583SJens Wiklander</P> 617*9403c583SJens Wiklander 618*9403c583SJens Wiklander<P> 619*9403c583SJens WiklanderAn individual exception flag can be cleared with the statement 620*9403c583SJens Wiklander<BLOCKQUOTE> 621*9403c583SJens Wiklander<CODE>softfloat_exceptionFlags &= ~softfloat_flag_<<I>exception</I>>;</CODE> 622*9403c583SJens Wiklander</BLOCKQUOTE> 623*9403c583SJens Wiklanderwhere <CODE><<I>exception</I>></CODE> is the appropriate name. 624*9403c583SJens WiklanderTo raise a floating-point exception, function <CODE>softfloat_raise</CODE> 625*9403c583SJens Wiklandershould normally be used. 626*9403c583SJens Wiklander</P> 627*9403c583SJens Wiklander 628*9403c583SJens Wiklander<P> 629*9403c583SJens WiklanderWhen SoftFloat detects an exception other than <I>inexact</I>, it calls 630*9403c583SJens Wiklander<CODE>softfloat_raise</CODE>. 631*9403c583SJens WiklanderThe default version of this function simply raises the corresponding exception 632*9403c583SJens Wiklanderflags. 633*9403c583SJens WiklanderParticular ports of SoftFloat may support alternate behavior, such as exception 634*9403c583SJens Wiklandertraps, by modifying the default <CODE>softfloat_raise</CODE>. 635*9403c583SJens WiklanderA program may also supply its own <CODE>softfloat_raise</CODE> function to 636*9403c583SJens Wiklanderoverride the one from the SoftFloat library. 637*9403c583SJens Wiklander</P> 638*9403c583SJens Wiklander 639*9403c583SJens Wiklander<P> 640*9403c583SJens WiklanderBecause inexact results occur frequently under most circumstances (and thus are 641*9403c583SJens Wiklanderhardly exceptional), SoftFloat does not ordinarily call 642*9403c583SJens Wiklander<CODE>softfloat_raise</CODE> for <I>inexact</I> exceptions. 643*9403c583SJens WiklanderIt does always raise the <I>inexact</I> exception flag as required. 644*9403c583SJens Wiklander</P> 645*9403c583SJens Wiklander 646*9403c583SJens Wiklander 647*9403c583SJens Wiklander<H2>8. Function Details</H2> 648*9403c583SJens Wiklander 649*9403c583SJens Wiklander<P> 650*9403c583SJens WiklanderIn this section, <CODE><<I>float</I>></CODE> appears in function names as 651*9403c583SJens Wiklandera substitute for one of these abbreviations: 652*9403c583SJens Wiklander<BLOCKQUOTE> 653*9403c583SJens Wiklander<TABLE CELLSPACING=0 CELLPADDING=0> 654*9403c583SJens Wiklander<TR> 655*9403c583SJens Wiklander<TD><CODE>f32</CODE></TD> 656*9403c583SJens Wiklander<TD>indicates <CODE>float32_t</CODE>, passed by value</TD> 657*9403c583SJens Wiklander</TR> 658*9403c583SJens Wiklander<TR> 659*9403c583SJens Wiklander<TD><CODE>f64</CODE></TD> 660*9403c583SJens Wiklander<TD>indicates <CODE>float64_t</CODE>, passed by value</TD> 661*9403c583SJens Wiklander</TR> 662*9403c583SJens Wiklander<TR> 663*9403c583SJens Wiklander<TD><CODE>extF80M </CODE></TD> 664*9403c583SJens Wiklander<TD>indicates <CODE>extFloat80_t</CODE>, passed indirectly via pointers</TD> 665*9403c583SJens Wiklander</TR> 666*9403c583SJens Wiklander<TR> 667*9403c583SJens Wiklander<TD><CODE>extF80</CODE></TD> 668*9403c583SJens Wiklander<TD>indicates <CODE>extFloat80_t</CODE>, passed by value</TD> 669*9403c583SJens Wiklander</TR> 670*9403c583SJens Wiklander<TR> 671*9403c583SJens Wiklander<TD><CODE>f128M</CODE></TD> 672*9403c583SJens Wiklander<TD>indicates <CODE>float128_t</CODE>, passed indirectly via pointers</TD> 673*9403c583SJens Wiklander</TR> 674*9403c583SJens Wiklander<TR> 675*9403c583SJens Wiklander<TD><CODE>f128</CODE></TD> 676*9403c583SJens Wiklander<TD>indicates <CODE>float128_t</CODE>, passed by value</TD> 677*9403c583SJens Wiklander</TR> 678*9403c583SJens Wiklander</TABLE> 679*9403c583SJens Wiklander</BLOCKQUOTE> 680*9403c583SJens WiklanderThe circumstances under which values of floating-point types 681*9403c583SJens Wiklander<CODE>extFloat80_t</CODE> and <CODE>float128_t</CODE> may be passed either by 682*9403c583SJens Wiklandervalue or indirectly via pointers was discussed earlier in 683*9403c583SJens Wiklander<NOBR>section 4.5</NOBR>, <I>Conventions for Passing Arguments and Results</I>. 684*9403c583SJens Wiklander</P> 685*9403c583SJens Wiklander 686*9403c583SJens Wiklander<H3>8.1. Conversions from Integer to Floating-Point</H3> 687*9403c583SJens Wiklander 688*9403c583SJens Wiklander<P> 689*9403c583SJens WiklanderAll conversions from a <NOBR>32-bit</NOBR> or <NOBR>64-bit</NOBR> integer, 690*9403c583SJens Wiklandersigned or unsigned, to a floating-point format are supported. 691*9403c583SJens WiklanderFunctions performing these conversions have these names: 692*9403c583SJens Wiklander<BLOCKQUOTE> 693*9403c583SJens Wiklander<CODE>ui32_to_<<I>float</I>></CODE><BR> 694*9403c583SJens Wiklander<CODE>ui64_to_<<I>float</I>></CODE><BR> 695*9403c583SJens Wiklander<CODE>i32_to_<<I>float</I>></CODE><BR> 696*9403c583SJens Wiklander<CODE>i64_to_<<I>float</I>></CODE> 697*9403c583SJens Wiklander</BLOCKQUOTE> 698*9403c583SJens WiklanderConversions from <NOBR>32-bit</NOBR> integers to <NOBR>64-bit</NOBR> 699*9403c583SJens Wiklanderdouble-precision and larger formats are always exact, and likewise conversions 700*9403c583SJens Wiklanderfrom <NOBR>64-bit</NOBR> integers to <NOBR>80-bit</NOBR> 701*9403c583SJens Wiklanderdouble-extended-precision and <NOBR>128-bit</NOBR> quadruple-precision are also 702*9403c583SJens Wiklanderalways exact. 703*9403c583SJens Wiklander</P> 704*9403c583SJens Wiklander 705*9403c583SJens Wiklander<P> 706*9403c583SJens WiklanderEach conversion function takes one input of the appropriate type and generates 707*9403c583SJens Wiklanderone output. 708*9403c583SJens WiklanderThe following illustrates the signatures of these functions in cases when the 709*9403c583SJens Wiklanderfloating-point result is passed either by value or via pointers: 710*9403c583SJens Wiklander<BLOCKQUOTE> 711*9403c583SJens Wiklander<PRE> 712*9403c583SJens Wiklanderfloat64_t i32_to_f64( int32_t <I>a</I> ); 713*9403c583SJens Wiklander</PRE> 714*9403c583SJens Wiklander<PRE> 715*9403c583SJens Wiklandervoid i32_to_f128M( int32_t <I>a</I>, float128_t *<I>destPtr</I> ); 716*9403c583SJens Wiklander</PRE> 717*9403c583SJens Wiklander</BLOCKQUOTE> 718*9403c583SJens Wiklander</P> 719*9403c583SJens Wiklander 720*9403c583SJens Wiklander<H3>8.2. Conversions from Floating-Point to Integer</H3> 721*9403c583SJens Wiklander 722*9403c583SJens Wiklander<P> 723*9403c583SJens WiklanderConversions from a floating-point format to a <NOBR>32-bit</NOBR> or 724*9403c583SJens Wiklander<NOBR>64-bit</NOBR> integer, signed or unsigned, are supported with these 725*9403c583SJens Wiklanderfunctions: 726*9403c583SJens Wiklander<BLOCKQUOTE> 727*9403c583SJens Wiklander<CODE><<I>float</I>>_to_ui32</CODE><BR> 728*9403c583SJens Wiklander<CODE><<I>float</I>>_to_ui64</CODE><BR> 729*9403c583SJens Wiklander<CODE><<I>float</I>>_to_i32</CODE><BR> 730*9403c583SJens Wiklander<CODE><<I>float</I>>_to_i64</CODE> 731*9403c583SJens Wiklander</BLOCKQUOTE> 732*9403c583SJens WiklanderThe functions have signatures as follows, depending on whether the 733*9403c583SJens Wiklanderfloating-point input is passed by value or via pointers: 734*9403c583SJens Wiklander<BLOCKQUOTE> 735*9403c583SJens Wiklander<PRE> 736*9403c583SJens Wiklanderint_fast32_t f64_to_i32( float64_t <I>a</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> ); 737*9403c583SJens Wiklander</PRE> 738*9403c583SJens Wiklander<PRE> 739*9403c583SJens Wiklanderint_fast32_t 740*9403c583SJens Wiklander f128M_to_i32( const float128_t *<I>aPtr</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> ); 741*9403c583SJens Wiklander</PRE> 742*9403c583SJens Wiklander</BLOCKQUOTE> 743*9403c583SJens WiklanderThe <CODE><I>roundingMode</I></CODE> argument specifies the rounding mode for 744*9403c583SJens Wiklanderthe conversion. 745*9403c583SJens WiklanderThe variable that usually indicates rounding mode, 746*9403c583SJens Wiklander<CODE>softfloat_roundingMode</CODE>, is ignored. 747*9403c583SJens WiklanderArgument <CODE><I>exact</I></CODE> determines whether the <I>inexact</I> 748*9403c583SJens Wiklanderexception flag is raised if the conversion is not exact. 749*9403c583SJens WiklanderIf <CODE><I>exact</I></CODE> is <CODE>true</CODE>, the <I>inexact</I> flag may 750*9403c583SJens Wiklanderbe raised; 751*9403c583SJens Wiklanderotherwise, it will not be, even if the conversion is inexact. 752*9403c583SJens Wiklander</P> 753*9403c583SJens Wiklander 754*9403c583SJens Wiklander<P> 755*9403c583SJens WiklanderConversions from floating-point to integer raise the <I>invalid</I> exception 756*9403c583SJens Wiklanderif the source value cannot be rounded to a representable integer of the desired 757*9403c583SJens Wiklandersize (32 or 64 bits). 758*9403c583SJens WiklanderIn such a circumstance, if the floating-point input is a NaN or if the 759*9403c583SJens Wiklanderconversion is to an unsigned integer type, the largest positive integer is 760*9403c583SJens Wiklanderreturned; 761*9403c583SJens Wiklanderotherwise, the largest integer with the same sign as the input is returned. 762*9403c583SJens WiklanderThe functions that convert to integer types never raise the <I>overflow</I> 763*9403c583SJens Wiklanderexception. 764*9403c583SJens Wiklander</P> 765*9403c583SJens Wiklander 766*9403c583SJens Wiklander<P> 767*9403c583SJens WiklanderNote that, when converting to an unsigned integer type, if the <I>invalid</I> 768*9403c583SJens Wiklanderexception is raised because the input floating-point value would round to a 769*9403c583SJens Wiklandernegative integer, the value returned is the <EM>maximum positive unsigned 770*9403c583SJens Wiklanderinteger</EM>. 771*9403c583SJens WiklanderZero is not returned when the <I>invalid</I> exception is raised, even when 772*9403c583SJens Wiklanderzero is the closest integer to the original floating-point value. 773*9403c583SJens Wiklander</P> 774*9403c583SJens Wiklander 775*9403c583SJens Wiklander<P> 776*9403c583SJens WiklanderBecause languages such <NOBR>as C</NOBR> require that conversions to integers 777*9403c583SJens Wiklanderbe rounded toward zero, the following functions are provided for improved speed 778*9403c583SJens Wiklanderand convenience: 779*9403c583SJens Wiklander<BLOCKQUOTE> 780*9403c583SJens Wiklander<CODE><<I>float</I>>_to_ui32_r_minMag</CODE><BR> 781*9403c583SJens Wiklander<CODE><<I>float</I>>_to_ui64_r_minMag</CODE><BR> 782*9403c583SJens Wiklander<CODE><<I>float</I>>_to_i32_r_minMag</CODE><BR> 783*9403c583SJens Wiklander<CODE><<I>float</I>>_to_i64_r_minMag</CODE> 784*9403c583SJens Wiklander</BLOCKQUOTE> 785*9403c583SJens WiklanderThese functions round only toward zero (to minimum magnitude). 786*9403c583SJens WiklanderThe signatures for these functions are the same as above without the redundant 787*9403c583SJens Wiklander<CODE><I>roundingMode</I></CODE> argument: 788*9403c583SJens Wiklander<BLOCKQUOTE> 789*9403c583SJens Wiklander<PRE> 790*9403c583SJens Wiklanderint_fast32_t f64_to_i32_r_minMag( float64_t <I>a</I>, bool <I>exact</I> ); 791*9403c583SJens Wiklander</PRE> 792*9403c583SJens Wiklander<PRE> 793*9403c583SJens Wiklanderint_fast32_t f128M_to_i32_r_minMag( const float128_t *<I>aPtr</I>, bool <I>exact</I> ); 794*9403c583SJens Wiklander</PRE> 795*9403c583SJens Wiklander</BLOCKQUOTE> 796*9403c583SJens Wiklander</P> 797*9403c583SJens Wiklander 798*9403c583SJens Wiklander<H3>8.3. Conversions Among Floating-Point Types</H3> 799*9403c583SJens Wiklander 800*9403c583SJens Wiklander<P> 801*9403c583SJens WiklanderConversions between floating-point formats are done by functions with these 802*9403c583SJens Wiklandernames: 803*9403c583SJens Wiklander<BLOCKQUOTE> 804*9403c583SJens Wiklander<CODE><<I>float</I>>_to_<<I>float</I>></CODE> 805*9403c583SJens Wiklander</BLOCKQUOTE> 806*9403c583SJens WiklanderAll combinations of source and result type are supported where the source and 807*9403c583SJens Wiklanderresult are different formats. 808*9403c583SJens WiklanderThere are four different styles of signature for these functions, depending on 809*9403c583SJens Wiklanderwhether the input and the output floating-point values are passed by value or 810*9403c583SJens Wiklandervia pointers: 811*9403c583SJens Wiklander<BLOCKQUOTE> 812*9403c583SJens Wiklander<PRE> 813*9403c583SJens Wiklanderfloat32_t f64_to_f32( float64_t <I>a</I> ); 814*9403c583SJens Wiklander</PRE> 815*9403c583SJens Wiklander<PRE> 816*9403c583SJens Wiklanderfloat32_t f128M_to_f32( const float128_t *<I>aPtr</I> ); 817*9403c583SJens Wiklander</PRE> 818*9403c583SJens Wiklander<PRE> 819*9403c583SJens Wiklandervoid f32_to_f128M( float32_t <I>a</I>, float128_t *<I>destPtr</I> ); 820*9403c583SJens Wiklander</PRE> 821*9403c583SJens Wiklander<PRE> 822*9403c583SJens Wiklandervoid extF80M_to_f128M( const extFloat80_t *<I>aPtr</I>, float128_t *<I>destPtr</I> ); 823*9403c583SJens Wiklander</PRE> 824*9403c583SJens Wiklander</BLOCKQUOTE> 825*9403c583SJens Wiklander</P> 826*9403c583SJens Wiklander 827*9403c583SJens Wiklander<P> 828*9403c583SJens WiklanderConversions from a smaller to a larger floating-point format are always exact 829*9403c583SJens Wiklanderand so require no rounding. 830*9403c583SJens Wiklander</P> 831*9403c583SJens Wiklander 832*9403c583SJens Wiklander<H3>8.4. Basic Arithmetic Functions</H3> 833*9403c583SJens Wiklander 834*9403c583SJens Wiklander<P> 835*9403c583SJens WiklanderThe following basic arithmetic functions are provided: 836*9403c583SJens Wiklander<BLOCKQUOTE> 837*9403c583SJens Wiklander<CODE><<I>float</I>>_add</CODE><BR> 838*9403c583SJens Wiklander<CODE><<I>float</I>>_sub</CODE><BR> 839*9403c583SJens Wiklander<CODE><<I>float</I>>_mul</CODE><BR> 840*9403c583SJens Wiklander<CODE><<I>float</I>>_div</CODE><BR> 841*9403c583SJens Wiklander<CODE><<I>float</I>>_sqrt</CODE> 842*9403c583SJens Wiklander</BLOCKQUOTE> 843*9403c583SJens WiklanderEach floating-point operation takes two operands, except for <CODE>sqrt</CODE> 844*9403c583SJens Wiklander(square root) which takes only one. 845*9403c583SJens WiklanderThe operands and result are all of the same floating-point format. 846*9403c583SJens WiklanderSignatures for these functions take the following forms: 847*9403c583SJens Wiklander<BLOCKQUOTE> 848*9403c583SJens Wiklander<PRE> 849*9403c583SJens Wiklanderfloat64_t f64_add( float64_t <I>a</I>, float64_t <I>b</I> ); 850*9403c583SJens Wiklander</PRE> 851*9403c583SJens Wiklander<PRE> 852*9403c583SJens Wiklandervoid 853*9403c583SJens Wiklander f128M_add( 854*9403c583SJens Wiklander const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I>, float128_t *<I>destPtr</I> ); 855*9403c583SJens Wiklander</PRE> 856*9403c583SJens Wiklander<PRE> 857*9403c583SJens Wiklanderfloat64_t f64_sqrt( float64_t <I>a</I> ); 858*9403c583SJens Wiklander</PRE> 859*9403c583SJens Wiklander<PRE> 860*9403c583SJens Wiklandervoid f128M_sqrt( const float128_t *<I>aPtr</I>, float128_t *<I>destPtr</I> ); 861*9403c583SJens Wiklander</PRE> 862*9403c583SJens Wiklander</BLOCKQUOTE> 863*9403c583SJens WiklanderWhen floating-point values are passed indirectly through pointers, arguments 864*9403c583SJens Wiklander<CODE><I>aPtr</I></CODE> and <CODE><I>bPtr</I></CODE> point to the input 865*9403c583SJens Wiklanderoperands, and the last argument, <CODE><I>destPtr</I></CODE>, points to the 866*9403c583SJens Wiklanderlocation where the result is stored. 867*9403c583SJens Wiklander</P> 868*9403c583SJens Wiklander 869*9403c583SJens Wiklander<P> 870*9403c583SJens WiklanderRounding of the <NOBR>80-bit</NOBR> double-extended-precision 871*9403c583SJens Wiklander(<CODE>extFloat80_t</CODE>) functions is affected by variable 872*9403c583SJens Wiklander<CODE>extF80_roundingPrecision</CODE>, as explained earlier in 873*9403c583SJens Wiklander<NOBR>section 6.3</NOBR>, 874*9403c583SJens Wiklander<I>Rounding Precision for the <NOBR>80-Bit</NOBR> Extended Format</I>. 875*9403c583SJens Wiklander</P> 876*9403c583SJens Wiklander 877*9403c583SJens Wiklander<H3>8.5. Fused Multiply-Add Functions</H3> 878*9403c583SJens Wiklander 879*9403c583SJens Wiklander<P> 880*9403c583SJens WiklanderThe 2008 version of the IEEE Floating-Point Standard defines a <I>fused 881*9403c583SJens Wiklandermultiply-add</I> operation that does a combined multiplication and addition 882*9403c583SJens Wiklanderwith only a single rounding. 883*9403c583SJens WiklanderSoftFloat implements fused multiply-add with functions 884*9403c583SJens Wiklander<BLOCKQUOTE> 885*9403c583SJens Wiklander<CODE><<I>float</I>>_mulAdd</CODE> 886*9403c583SJens Wiklander</BLOCKQUOTE> 887*9403c583SJens WiklanderUnlike other operations, fused multiple-add is supported only for the 888*9403c583SJens Wiklandernon-extended formats, <CODE>float32_t</CODE>, <CODE>float64_t</CODE>, and 889*9403c583SJens Wiklander<CODE>float128_t</CODE>. 890*9403c583SJens WiklanderNo fused multiple-add function is currently provided for the 891*9403c583SJens Wiklander<NOBR>80-bit</NOBR> double-extended-precision type, <CODE>extFloat80_t</CODE>. 892*9403c583SJens Wiklander</P> 893*9403c583SJens Wiklander 894*9403c583SJens Wiklander<P> 895*9403c583SJens WiklanderDepending on whether floating-point values are passed by value or via pointers, 896*9403c583SJens Wiklanderthe fused multiply-add functions have signatures of these forms: 897*9403c583SJens Wiklander<BLOCKQUOTE> 898*9403c583SJens Wiklander<PRE> 899*9403c583SJens Wiklanderfloat64_t f64_mulAdd( float64_t <I>a</I>, float64_t <I>b</I>, float64_t <I>c</I> ); 900*9403c583SJens Wiklander</PRE> 901*9403c583SJens Wiklander<PRE> 902*9403c583SJens Wiklandervoid 903*9403c583SJens Wiklander f128M_mulAdd( 904*9403c583SJens Wiklander const float128_t *<I>aPtr</I>, 905*9403c583SJens Wiklander const float128_t *<I>bPtr</I>, 906*9403c583SJens Wiklander const float128_t *<I>cPtr</I>, 907*9403c583SJens Wiklander float128_t *<I>destPtr</I> 908*9403c583SJens Wiklander ); 909*9403c583SJens Wiklander</PRE> 910*9403c583SJens Wiklander</BLOCKQUOTE> 911*9403c583SJens WiklanderThe functions compute 912*9403c583SJens Wiklander<NOBR>(<CODE><I>a</I></CODE> × <CODE><I>b</I></CODE>) 913*9403c583SJens Wiklander + <CODE><I>c</I></CODE></NOBR> 914*9403c583SJens Wiklanderwith a single rounding. 915*9403c583SJens WiklanderWhen floating-point values are passed indirectly through pointers, arguments 916*9403c583SJens Wiklander<CODE><I>aPtr</I></CODE>, <CODE><I>bPtr</I></CODE>, and 917*9403c583SJens Wiklander<CODE><I>cPtr</I></CODE> point to operands <CODE><I>a</I></CODE>, 918*9403c583SJens Wiklander<CODE><I>b</I></CODE>, and <CODE><I>c</I></CODE> respectively, and 919*9403c583SJens Wiklander<CODE><I>destPtr</I></CODE> points to the location where the result is stored. 920*9403c583SJens Wiklander</P> 921*9403c583SJens Wiklander 922*9403c583SJens Wiklander<P> 923*9403c583SJens WiklanderIf one of the multiplication operands <CODE><I>a</I></CODE> and 924*9403c583SJens Wiklander<CODE><I>b</I></CODE> is infinite and the other is zero, these functions raise 925*9403c583SJens Wiklanderthe invalid exception even if operand <CODE><I>c</I></CODE> is a quiet NaN. 926*9403c583SJens Wiklander</P> 927*9403c583SJens Wiklander 928*9403c583SJens Wiklander<H3>8.6. Remainder Functions</H3> 929*9403c583SJens Wiklander 930*9403c583SJens Wiklander<P> 931*9403c583SJens WiklanderFor each format, SoftFloat implements the remainder operation defined by the 932*9403c583SJens WiklanderIEEE Floating-Point Standard. 933*9403c583SJens WiklanderThe remainder functions have names 934*9403c583SJens Wiklander<BLOCKQUOTE> 935*9403c583SJens Wiklander<CODE><<I>float</I>>_rem</CODE> 936*9403c583SJens Wiklander</BLOCKQUOTE> 937*9403c583SJens WiklanderEach remainder operation takes two floating-point operands of the same format 938*9403c583SJens Wiklanderand returns a result in the same format. 939*9403c583SJens WiklanderDepending on whether floating-point values are passed by value or via pointers, 940*9403c583SJens Wiklanderthe remainder functions have signatures of these forms: 941*9403c583SJens Wiklander<BLOCKQUOTE> 942*9403c583SJens Wiklander<PRE> 943*9403c583SJens Wiklanderfloat64_t f64_rem( float64_t <I>a</I>, float64_t <I>b</I> ); 944*9403c583SJens Wiklander</PRE> 945*9403c583SJens Wiklander<PRE> 946*9403c583SJens Wiklandervoid 947*9403c583SJens Wiklander f128M_rem( 948*9403c583SJens Wiklander const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I>, float128_t *<I>destPtr</I> ); 949*9403c583SJens Wiklander</PRE> 950*9403c583SJens Wiklander</BLOCKQUOTE> 951*9403c583SJens WiklanderWhen floating-point values are passed indirectly through pointers, arguments 952*9403c583SJens Wiklander<CODE><I>aPtr</I></CODE> and <CODE><I>bPtr</I></CODE> point to operands 953*9403c583SJens Wiklander<CODE><I>a</I></CODE> and <CODE><I>b</I></CODE> respectively, and 954*9403c583SJens Wiklander<CODE><I>destPtr</I></CODE> points to the location where the result is stored. 955*9403c583SJens Wiklander</P> 956*9403c583SJens Wiklander 957*9403c583SJens Wiklander<P> 958*9403c583SJens WiklanderThe IEEE Standard remainder operation computes the value 959*9403c583SJens Wiklander<NOBR><CODE><I>a</I></CODE> 960*9403c583SJens Wiklander − <I>n</I> × <CODE><I>b</I></CODE></NOBR>, 961*9403c583SJens Wiklanderwhere <I>n</I> is the integer closest to 962*9403c583SJens Wiklander<NOBR><CODE><I>a</I></CODE> ÷ <CODE><I>b</I></CODE></NOBR>. 963*9403c583SJens WiklanderIf <NOBR><CODE><I>a</I></CODE> ÷ <CODE><I>b</I></CODE></NOBR> is exactly 964*9403c583SJens Wiklanderhalfway between two integers, <I>n</I> is the <EM>even</EM> integer closest to 965*9403c583SJens Wiklander<NOBR><CODE><I>a</I></CODE> ÷ <CODE><I>b</I></CODE></NOBR>. 966*9403c583SJens WiklanderThe IEEE Standard’s remainder operation is always exact and so requires 967*9403c583SJens Wiklanderno rounding. 968*9403c583SJens Wiklander</P> 969*9403c583SJens Wiklander 970*9403c583SJens Wiklander<P> 971*9403c583SJens WiklanderDepending on the relative magnitudes of the operands, the remainder 972*9403c583SJens Wiklanderfunctions can take considerably longer to execute than the other SoftFloat 973*9403c583SJens Wiklanderfunctions. 974*9403c583SJens WiklanderThis is inherent in the remainder operation itself and is not a flaw in the 975*9403c583SJens WiklanderSoftFloat implementation. 976*9403c583SJens Wiklander</P> 977*9403c583SJens Wiklander 978*9403c583SJens Wiklander<H3>8.7. Round-to-Integer Functions</H3> 979*9403c583SJens Wiklander 980*9403c583SJens Wiklander<P> 981*9403c583SJens WiklanderFor each format, SoftFloat implements the round-to-integer operation specified 982*9403c583SJens Wiklanderby the IEEE Floating-Point Standard. 983*9403c583SJens WiklanderThese functions are named 984*9403c583SJens Wiklander<BLOCKQUOTE> 985*9403c583SJens Wiklander<CODE><<I>float</I>>_roundToInt</CODE> 986*9403c583SJens Wiklander</BLOCKQUOTE> 987*9403c583SJens WiklanderEach round-to-integer operation takes a single floating-point operand. 988*9403c583SJens WiklanderThis operand is rounded to an integer according to a specified rounding mode, 989*9403c583SJens Wiklanderand the resulting integer value is returned in the same floating-point format. 990*9403c583SJens Wiklander(Note that the result is not an integer type.) 991*9403c583SJens Wiklander</P> 992*9403c583SJens Wiklander 993*9403c583SJens Wiklander<P> 994*9403c583SJens WiklanderThe signatures of the round-to-integer functions are similar to those for 995*9403c583SJens Wiklanderconversions to an integer type: 996*9403c583SJens Wiklander<BLOCKQUOTE> 997*9403c583SJens Wiklander<PRE> 998*9403c583SJens Wiklanderfloat64_t f64_roundToInt( float64_t <I>a</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> ); 999*9403c583SJens Wiklander</PRE> 1000*9403c583SJens Wiklander<PRE> 1001*9403c583SJens Wiklandervoid 1002*9403c583SJens Wiklander f128M_roundToInt( 1003*9403c583SJens Wiklander const float128_t *<I>aPtr</I>, 1004*9403c583SJens Wiklander uint_fast8_t <I>roundingMode</I>, 1005*9403c583SJens Wiklander bool <I>exact</I>, 1006*9403c583SJens Wiklander float128_t *<I>destPtr</I> 1007*9403c583SJens Wiklander ); 1008*9403c583SJens Wiklander</PRE> 1009*9403c583SJens Wiklander</BLOCKQUOTE> 1010*9403c583SJens WiklanderThe <CODE><I>roundingMode</I></CODE> argument specifies the rounding mode to 1011*9403c583SJens Wiklanderapply. 1012*9403c583SJens WiklanderThe variable that usually indicates rounding mode, 1013*9403c583SJens Wiklander<CODE>softfloat_roundingMode</CODE>, is ignored. 1014*9403c583SJens WiklanderArgument <CODE><I>exact</I></CODE> determines whether the <I>inexact</I> 1015*9403c583SJens Wiklanderexception flag is raised if the conversion is not exact. 1016*9403c583SJens WiklanderIf <CODE><I>exact</I></CODE> is <CODE>true</CODE>, the <I>inexact</I> flag may 1017*9403c583SJens Wiklanderbe raised; 1018*9403c583SJens Wiklanderotherwise, it will not be, even if the conversion is inexact. 1019*9403c583SJens WiklanderWhen floating-point values are passed indirectly through pointers, 1020*9403c583SJens Wiklander<CODE><I>aPtr</I></CODE> points to the input operand and 1021*9403c583SJens Wiklander<CODE><I>destPtr</I></CODE> points to the location where the result is stored. 1022*9403c583SJens Wiklander</P> 1023*9403c583SJens Wiklander 1024*9403c583SJens Wiklander<H3>8.8. Comparison Functions</H3> 1025*9403c583SJens Wiklander 1026*9403c583SJens Wiklander<P> 1027*9403c583SJens WiklanderFor each format, the following floating-point comparison functions are 1028*9403c583SJens Wiklanderprovided: 1029*9403c583SJens Wiklander<BLOCKQUOTE> 1030*9403c583SJens Wiklander<CODE><<I>float</I>>_eq</CODE><BR> 1031*9403c583SJens Wiklander<CODE><<I>float</I>>_le</CODE><BR> 1032*9403c583SJens Wiklander<CODE><<I>float</I>>_lt</CODE> 1033*9403c583SJens Wiklander</BLOCKQUOTE> 1034*9403c583SJens WiklanderEach comparison takes two operands of the same type and returns a Boolean. 1035*9403c583SJens WiklanderThe abbreviation <CODE>eq</CODE> stands for “equal” (=); 1036*9403c583SJens Wiklander<CODE>le</CODE> stands for “less than or equal” (≤); 1037*9403c583SJens Wiklanderand <CODE>lt</CODE> stands for “less than” (<). 1038*9403c583SJens WiklanderDepending on whether the floating-point operands are passed by value or via 1039*9403c583SJens Wiklanderpointers, the comparison functions have signatures of these forms: 1040*9403c583SJens Wiklander<BLOCKQUOTE> 1041*9403c583SJens Wiklander<PRE> 1042*9403c583SJens Wiklanderbool f64_eq( float64_t <I>a</I>, float64_t <I>b</I> ); 1043*9403c583SJens Wiklander</PRE> 1044*9403c583SJens Wiklander<PRE> 1045*9403c583SJens Wiklanderbool f128M_eq( const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I> ); 1046*9403c583SJens Wiklander</PRE> 1047*9403c583SJens Wiklander</BLOCKQUOTE> 1048*9403c583SJens Wiklander</P> 1049*9403c583SJens Wiklander 1050*9403c583SJens Wiklander<P> 1051*9403c583SJens WiklanderThe usual greater-than (>), greater-than-or-equal (≥), and not-equal 1052*9403c583SJens Wiklander(≠) comparisons are easily obtained from the functions provided. 1053*9403c583SJens WiklanderThe not-equal function is just the logical complement of the equal function. 1054*9403c583SJens WiklanderThe greater-than-or-equal function is identical to the less-than-or-equal 1055*9403c583SJens Wiklanderfunction with the arguments in reverse order, and likewise the greater-than 1056*9403c583SJens Wiklanderfunction is identical to the less-than function with the arguments reversed. 1057*9403c583SJens Wiklander</P> 1058*9403c583SJens Wiklander 1059*9403c583SJens Wiklander<P> 1060*9403c583SJens WiklanderThe IEEE Floating-Point Standard specifies that the less-than-or-equal and 1061*9403c583SJens Wiklanderless-than comparisons by default raise the <I>invalid</I> exception if either 1062*9403c583SJens Wiklanderoperand is any kind of NaN. 1063*9403c583SJens WiklanderEquality comparisons, on the other hand, are defined by default to raise the 1064*9403c583SJens Wiklander<I>invalid</I> exception only for signaling NaNs, not quiet NaNs. 1065*9403c583SJens WiklanderFor completeness, SoftFloat provides these complementary functions: 1066*9403c583SJens Wiklander<BLOCKQUOTE> 1067*9403c583SJens Wiklander<CODE><<I>float</I>>_eq_signaling</CODE><BR> 1068*9403c583SJens Wiklander<CODE><<I>float</I>>_le_quiet</CODE><BR> 1069*9403c583SJens Wiklander<CODE><<I>float</I>>_lt_quiet</CODE> 1070*9403c583SJens Wiklander</BLOCKQUOTE> 1071*9403c583SJens WiklanderThe <CODE>signaling</CODE> equality comparisons are identical to the default 1072*9403c583SJens Wiklanderequality comparisons except that the <I>invalid</I> exception is raised for any 1073*9403c583SJens WiklanderNaN input, not just for signaling NaNs. 1074*9403c583SJens WiklanderSimilarly, the <CODE>quiet</CODE> comparison functions are identical to their 1075*9403c583SJens Wiklanderdefault counterparts except that the <I>invalid</I> exception is not raised for 1076*9403c583SJens Wiklanderquiet NaNs. 1077*9403c583SJens Wiklander</P> 1078*9403c583SJens Wiklander 1079*9403c583SJens Wiklander<H3>8.9. Signaling NaN Test Functions</H3> 1080*9403c583SJens Wiklander 1081*9403c583SJens Wiklander<P> 1082*9403c583SJens WiklanderFunctions for testing whether a floating-point value is a signaling NaN are 1083*9403c583SJens Wiklanderprovided with these names: 1084*9403c583SJens Wiklander<BLOCKQUOTE> 1085*9403c583SJens Wiklander<CODE><<I>float</I>>_isSignalingNaN</CODE> 1086*9403c583SJens Wiklander</BLOCKQUOTE> 1087*9403c583SJens WiklanderThe functions take one floating-point operand and return a Boolean indicating 1088*9403c583SJens Wiklanderwhether the operand is a signaling NaN. 1089*9403c583SJens WiklanderAccordingly, the functions have the forms 1090*9403c583SJens Wiklander<BLOCKQUOTE> 1091*9403c583SJens Wiklander<PRE> 1092*9403c583SJens Wiklanderbool f64_isSignalingNaN( float64_t <I>a</I> ); 1093*9403c583SJens Wiklander</PRE> 1094*9403c583SJens Wiklander<PRE> 1095*9403c583SJens Wiklanderbool f128M_isSignalingNaN( const float128_t *<I>aPtr</I> ); 1096*9403c583SJens Wiklander</PRE> 1097*9403c583SJens Wiklander</BLOCKQUOTE> 1098*9403c583SJens Wiklander</P> 1099*9403c583SJens Wiklander 1100*9403c583SJens Wiklander<H3>8.10. Raise-Exception Function</H3> 1101*9403c583SJens Wiklander 1102*9403c583SJens Wiklander<P> 1103*9403c583SJens WiklanderSoftFloat provides a single function for raising floating-point exceptions: 1104*9403c583SJens Wiklander<BLOCKQUOTE> 1105*9403c583SJens Wiklander<PRE> 1106*9403c583SJens Wiklandervoid softfloat_raise( uint_fast8_t <I>exceptions</I> ); 1107*9403c583SJens Wiklander</PRE> 1108*9403c583SJens Wiklander</BLOCKQUOTE> 1109*9403c583SJens WiklanderThe <CODE><I>exceptions</I></CODE> argument is a mask indicating the set of 1110*9403c583SJens Wiklanderexceptions to raise. 1111*9403c583SJens Wiklander(See earlier section 7, <I>Exceptions and Exception Flags</I>.) 1112*9403c583SJens WiklanderIn addition to setting the specified exception flags in variable 1113*9403c583SJens Wiklander<CODE>softfloat_exceptionFlags</CODE>, the <CODE>softfloat_raise</CODE> 1114*9403c583SJens Wiklanderfunction may cause a trap or abort appropriate for the current system. 1115*9403c583SJens Wiklander</P> 1116*9403c583SJens Wiklander 1117*9403c583SJens Wiklander 1118*9403c583SJens Wiklander<H2>9. Changes from SoftFloat <NOBR>Release 2</NOBR></H2> 1119*9403c583SJens Wiklander 1120*9403c583SJens Wiklander<P> 1121*9403c583SJens WiklanderApart from a change in the legal use license, <NOBR>Release 3</NOBR> of 1122*9403c583SJens WiklanderSoftFloat introduced numerous technical differences compared to earlier 1123*9403c583SJens Wiklanderreleases. 1124*9403c583SJens Wiklander</P> 1125*9403c583SJens Wiklander 1126*9403c583SJens Wiklander<H3>9.1. Name Changes</H3> 1127*9403c583SJens Wiklander 1128*9403c583SJens Wiklander<P> 1129*9403c583SJens WiklanderThe most obvious and pervasive difference compared to <NOBR>Release 2</NOBR> 1130*9403c583SJens Wiklanderis that the names of most functions and variables have changed, even when the 1131*9403c583SJens Wiklanderbehavior has not. 1132*9403c583SJens WiklanderFirst, the floating-point types, the mode variables, the exception flags 1133*9403c583SJens Wiklandervariable, the function to raise exceptions, and various associated constants 1134*9403c583SJens Wiklanderhave been renamed as follows: 1135*9403c583SJens Wiklander<BLOCKQUOTE> 1136*9403c583SJens Wiklander<TABLE> 1137*9403c583SJens Wiklander<TR> 1138*9403c583SJens Wiklander<TD>old name, Release 2:</TD> 1139*9403c583SJens Wiklander<TD>new name, Release 3:</TD> 1140*9403c583SJens Wiklander</TR> 1141*9403c583SJens Wiklander<TR> 1142*9403c583SJens Wiklander<TD><CODE>float32</CODE></TD> 1143*9403c583SJens Wiklander<TD><CODE>float32_t</CODE></TD> 1144*9403c583SJens Wiklander</TR> 1145*9403c583SJens Wiklander<TR> 1146*9403c583SJens Wiklander<TD><CODE>float64</CODE></TD> 1147*9403c583SJens Wiklander<TD><CODE>float64_t</CODE></TD> 1148*9403c583SJens Wiklander</TR> 1149*9403c583SJens Wiklander<TR> 1150*9403c583SJens Wiklander<TD><CODE>floatx80</CODE></TD> 1151*9403c583SJens Wiklander<TD><CODE>extFloat80_t</CODE></TD> 1152*9403c583SJens Wiklander</TR> 1153*9403c583SJens Wiklander<TR> 1154*9403c583SJens Wiklander<TD><CODE>float128</CODE></TD> 1155*9403c583SJens Wiklander<TD><CODE>float128_t</CODE></TD> 1156*9403c583SJens Wiklander</TR> 1157*9403c583SJens Wiklander<TR> 1158*9403c583SJens Wiklander<TD><CODE>float_rounding_mode</CODE></TD> 1159*9403c583SJens Wiklander<TD><CODE>softfloat_roundingMode</CODE></TD> 1160*9403c583SJens Wiklander</TR> 1161*9403c583SJens Wiklander<TR> 1162*9403c583SJens Wiklander<TD><CODE>float_round_nearest_even</CODE></TD> 1163*9403c583SJens Wiklander<TD><CODE>softfloat_round_near_even</CODE></TD> 1164*9403c583SJens Wiklander</TR> 1165*9403c583SJens Wiklander<TR> 1166*9403c583SJens Wiklander<TD><CODE>float_round_to_zero</CODE></TD> 1167*9403c583SJens Wiklander<TD><CODE>softfloat_round_minMag</CODE></TD> 1168*9403c583SJens Wiklander</TR> 1169*9403c583SJens Wiklander<TR> 1170*9403c583SJens Wiklander<TD><CODE>float_round_down</CODE></TD> 1171*9403c583SJens Wiklander<TD><CODE>softfloat_round_min</CODE></TD> 1172*9403c583SJens Wiklander</TR> 1173*9403c583SJens Wiklander<TR> 1174*9403c583SJens Wiklander<TD><CODE>float_round_up</CODE></TD> 1175*9403c583SJens Wiklander<TD><CODE>softfloat_round_max</CODE></TD> 1176*9403c583SJens Wiklander</TR> 1177*9403c583SJens Wiklander<TR> 1178*9403c583SJens Wiklander<TD><CODE>float_detect_tininess</CODE></TD> 1179*9403c583SJens Wiklander<TD><CODE>softfloat_detectTininess</CODE></TD> 1180*9403c583SJens Wiklander</TR> 1181*9403c583SJens Wiklander<TR> 1182*9403c583SJens Wiklander<TD><CODE>float_tininess_before_rounding </CODE></TD> 1183*9403c583SJens Wiklander<TD><CODE>softfloat_tininess_beforeRounding</CODE></TD> 1184*9403c583SJens Wiklander</TR> 1185*9403c583SJens Wiklander<TR> 1186*9403c583SJens Wiklander<TD><CODE>float_tininess_after_rounding</CODE></TD> 1187*9403c583SJens Wiklander<TD><CODE>softfloat_tininess_afterRounding</CODE></TD> 1188*9403c583SJens Wiklander</TR> 1189*9403c583SJens Wiklander<TR> 1190*9403c583SJens Wiklander<TD><CODE>floatx80_rounding_precision</CODE></TD> 1191*9403c583SJens Wiklander<TD><CODE>extF80_roundingPrecision</CODE></TD> 1192*9403c583SJens Wiklander</TR> 1193*9403c583SJens Wiklander<TR> 1194*9403c583SJens Wiklander<TD><CODE>float_exception_flags</CODE></TD> 1195*9403c583SJens Wiklander<TD><CODE>softfloat_exceptionFlags</CODE></TD> 1196*9403c583SJens Wiklander</TR> 1197*9403c583SJens Wiklander<TR> 1198*9403c583SJens Wiklander<TD><CODE>float_flag_inexact</CODE></TD> 1199*9403c583SJens Wiklander<TD><CODE>softfloat_flag_inexact</CODE></TD> 1200*9403c583SJens Wiklander</TR> 1201*9403c583SJens Wiklander<TR> 1202*9403c583SJens Wiklander<TD><CODE>float_flag_underflow</CODE></TD> 1203*9403c583SJens Wiklander<TD><CODE>softfloat_flag_underflow</CODE></TD> 1204*9403c583SJens Wiklander</TR> 1205*9403c583SJens Wiklander<TR> 1206*9403c583SJens Wiklander<TD><CODE>float_flag_overflow</CODE></TD> 1207*9403c583SJens Wiklander<TD><CODE>softfloat_flag_overflow</CODE></TD> 1208*9403c583SJens Wiklander</TR> 1209*9403c583SJens Wiklander<TR> 1210*9403c583SJens Wiklander<TD><CODE>float_flag_divbyzero</CODE></TD> 1211*9403c583SJens Wiklander<TD><CODE>softfloat_flag_infinite</CODE></TD> 1212*9403c583SJens Wiklander</TR> 1213*9403c583SJens Wiklander<TR> 1214*9403c583SJens Wiklander<TD><CODE>float_flag_invalid</CODE></TD> 1215*9403c583SJens Wiklander<TD><CODE>softfloat_flag_invalid</CODE></TD> 1216*9403c583SJens Wiklander</TR> 1217*9403c583SJens Wiklander<TR> 1218*9403c583SJens Wiklander<TD><CODE>float_raise</CODE></TD> 1219*9403c583SJens Wiklander<TD><CODE>softfloat_raise</CODE></TD> 1220*9403c583SJens Wiklander</TR> 1221*9403c583SJens Wiklander</TABLE> 1222*9403c583SJens Wiklander</BLOCKQUOTE> 1223*9403c583SJens Wiklander</P> 1224*9403c583SJens Wiklander 1225*9403c583SJens Wiklander<P> 1226*9403c583SJens WiklanderFurthermore, <NOBR>Release 3</NOBR> adopted the following new abbreviations for 1227*9403c583SJens Wiklanderfunction names: 1228*9403c583SJens Wiklander<BLOCKQUOTE> 1229*9403c583SJens Wiklander<TABLE> 1230*9403c583SJens Wiklander<TR> 1231*9403c583SJens Wiklander<TD>used in names in Release 2:<CODE> </CODE></TD> 1232*9403c583SJens Wiklander<TD>used in names in Release 3:</TD> 1233*9403c583SJens Wiklander</TR> 1234*9403c583SJens Wiklander<TR> <TD><CODE>int32</CODE></TD> <TD><CODE>i32</CODE></TD> </TR> 1235*9403c583SJens Wiklander<TR> <TD><CODE>int64</CODE></TD> <TD><CODE>i64</CODE></TD> </TR> 1236*9403c583SJens Wiklander<TR> <TD><CODE>float32</CODE></TD> <TD><CODE>f32</CODE></TD> </TR> 1237*9403c583SJens Wiklander<TR> <TD><CODE>float64</CODE></TD> <TD><CODE>f64</CODE></TD> </TR> 1238*9403c583SJens Wiklander<TR> <TD><CODE>floatx80</CODE></TD> <TD><CODE>extF80</CODE></TD> </TR> 1239*9403c583SJens Wiklander<TR> <TD><CODE>float128</CODE></TD> <TD><CODE>f128</CODE></TD> </TR> 1240*9403c583SJens Wiklander</TABLE> 1241*9403c583SJens Wiklander</BLOCKQUOTE> 1242*9403c583SJens WiklanderThus, for example, the function to add two <NOBR>32-bit</NOBR> floating-point 1243*9403c583SJens Wiklandernumbers, previously called <CODE>float32_add</CODE> in <NOBR>Release 2</NOBR>, 1244*9403c583SJens Wiklanderis now <CODE>f32_add</CODE>. 1245*9403c583SJens WiklanderLastly, there have been a few other changes to function names: 1246*9403c583SJens Wiklander<BLOCKQUOTE> 1247*9403c583SJens Wiklander<TABLE> 1248*9403c583SJens Wiklander<TR> 1249*9403c583SJens Wiklander<TD>used in names in Release 2:<CODE> </CODE></TD> 1250*9403c583SJens Wiklander<TD>used in names in Release 3:<CODE> </CODE></TD> 1251*9403c583SJens Wiklander<TD>relevant functions:</TD> 1252*9403c583SJens Wiklander</TR> 1253*9403c583SJens Wiklander<TR> 1254*9403c583SJens Wiklander<TD><CODE>_round_to_zero</CODE></TD> 1255*9403c583SJens Wiklander<TD><CODE>_r_minMag</CODE></TD> 1256*9403c583SJens Wiklander<TD>conversions from floating-point to integer (<NOBR>section 8.2</NOBR>)</TD> 1257*9403c583SJens Wiklander</TR> 1258*9403c583SJens Wiklander<TR> 1259*9403c583SJens Wiklander<TD><CODE>round_to_int</CODE></TD> 1260*9403c583SJens Wiklander<TD><CODE>roundToInt</CODE></TD> 1261*9403c583SJens Wiklander<TD>round-to-integer functions (<NOBR>section 8.7</NOBR>)</TD> 1262*9403c583SJens Wiklander</TR> 1263*9403c583SJens Wiklander<TR> 1264*9403c583SJens Wiklander<TD><CODE>is_signaling_nan </CODE></TD> 1265*9403c583SJens Wiklander<TD><CODE>isSignalingNaN</CODE></TD> 1266*9403c583SJens Wiklander<TD>signaling NaN test functions (<NOBR>section 8.9</NOBR>)</TD> 1267*9403c583SJens Wiklander</TR> 1268*9403c583SJens Wiklander</TABLE> 1269*9403c583SJens Wiklander</BLOCKQUOTE> 1270*9403c583SJens Wiklander</P> 1271*9403c583SJens Wiklander 1272*9403c583SJens Wiklander<H3>9.2. Changes to Function Arguments</H3> 1273*9403c583SJens Wiklander 1274*9403c583SJens Wiklander<P> 1275*9403c583SJens WiklanderBesides simple name changes, some operations were given a different interface 1276*9403c583SJens Wiklanderin <NOBR>Release 3</NOBR> than they had in <NOBR>Release 2</NOBR>: 1277*9403c583SJens Wiklander<UL> 1278*9403c583SJens Wiklander 1279*9403c583SJens Wiklander<LI> 1280*9403c583SJens Wiklander<P> 1281*9403c583SJens WiklanderSince <NOBR>Release 3</NOBR>, integer arguments and results of functions have 1282*9403c583SJens Wiklanderstandard types from header <CODE><stdint.h></CODE>, such as 1283*9403c583SJens Wiklander<CODE>uint32_t</CODE>, whereas previously their types could be defined 1284*9403c583SJens Wiklanderdifferently for each port of SoftFloat, usually using traditional C types such 1285*9403c583SJens Wiklanderas <CODE>unsigned</CODE> <CODE>int</CODE>. 1286*9403c583SJens WiklanderLikewise, functions in <NOBR>Release 3</NOBR> and later pass Booleans as 1287*9403c583SJens Wiklanderstandard type <CODE>bool</CODE> from <CODE><stdbool.h></CODE>, whereas 1288*9403c583SJens Wiklanderpreviously these were again passed as a port-specific type (usually 1289*9403c583SJens Wiklander<CODE>int</CODE>). 1290*9403c583SJens Wiklander</P> 1291*9403c583SJens Wiklander 1292*9403c583SJens Wiklander<LI> 1293*9403c583SJens Wiklander<P> 1294*9403c583SJens WiklanderAs explained earlier in <NOBR>section 4.5</NOBR>, <I>Conventions for Passing 1295*9403c583SJens WiklanderArguments and Results</I>, SoftFloat functions in <NOBR>Release 3</NOBR> and 1296*9403c583SJens Wiklanderlater may pass <NOBR>80-bit</NOBR> and <NOBR>128-bit</NOBR> floating-point 1297*9403c583SJens Wiklandervalues through pointers, meaning that functions take pointer arguments and then 1298*9403c583SJens Wiklanderread or write floating-point values at the locations indicated by the pointers. 1299*9403c583SJens WiklanderIn <NOBR>Release 2</NOBR>, floating-point arguments and results were always 1300*9403c583SJens Wiklanderpassed by value, regardless of their size. 1301*9403c583SJens Wiklander</P> 1302*9403c583SJens Wiklander 1303*9403c583SJens Wiklander<LI> 1304*9403c583SJens Wiklander<P> 1305*9403c583SJens WiklanderFunctions that round to an integer have additional 1306*9403c583SJens Wiklander<CODE><I>roundingMode</I></CODE> and <CODE><I>exact</I></CODE> arguments that 1307*9403c583SJens Wiklanderthey did not have in <NOBR>Release 2</NOBR>. 1308*9403c583SJens WiklanderRefer to sections 8.2 <NOBR>and 8.7</NOBR> for descriptions of these functions 1309*9403c583SJens Wiklandersince <NOBR>Release 3</NOBR>. 1310*9403c583SJens WiklanderFor <NOBR>Release 2</NOBR>, the rounding mode, when needed, was taken from the 1311*9403c583SJens Wiklandersame global variable that affects the basic arithmetic operations (now called 1312*9403c583SJens Wiklander<CODE>softfloat_roundingMode</CODE> but previously known as 1313*9403c583SJens Wiklander<CODE>float_rounding_mode</CODE>). 1314*9403c583SJens WiklanderAlso, for <NOBR>Release 2</NOBR>, if the original floating-point input was not 1315*9403c583SJens Wiklanderan exact integer value, and if the <I>invalid</I> exception was not raised by 1316*9403c583SJens Wiklanderthe function, the <I>inexact</I> exception was always raised. 1317*9403c583SJens Wiklander<NOBR>Release 2</NOBR> had no option to suppress raising <I>inexact</I> in this 1318*9403c583SJens Wiklandercase. 1319*9403c583SJens WiklanderApplications using SoftFloat <NOBR>Release 3</NOBR> or later can get the same 1320*9403c583SJens Wiklandereffect as <NOBR>Release 2</NOBR> by passing variable 1321*9403c583SJens Wiklander<CODE>softfloat_roundingMode</CODE> for argument 1322*9403c583SJens Wiklander<CODE><I>roundingMode</I></CODE> and <CODE>true</CODE> for argument 1323*9403c583SJens Wiklander<CODE><I>exact</I></CODE>. 1324*9403c583SJens Wiklander</P> 1325*9403c583SJens Wiklander 1326*9403c583SJens Wiklander</UL> 1327*9403c583SJens Wiklander</P> 1328*9403c583SJens Wiklander 1329*9403c583SJens Wiklander<H3>9.3. Added Capabilities</H3> 1330*9403c583SJens Wiklander 1331*9403c583SJens Wiklander<P> 1332*9403c583SJens WiklanderWith <NOBR>Release 3</NOBR>, some new features have been added that were not 1333*9403c583SJens Wiklanderpresent in <NOBR>Release 2</NOBR>: 1334*9403c583SJens Wiklander<UL> 1335*9403c583SJens Wiklander 1336*9403c583SJens Wiklander<LI> 1337*9403c583SJens Wiklander<P> 1338*9403c583SJens WiklanderA port of SoftFloat can now define any of the floating-point types 1339*9403c583SJens Wiklander<CODE>float32_t</CODE>, <CODE>float64_t</CODE>, <CODE>extFloat80_t</CODE>, and 1340*9403c583SJens Wiklander<CODE>float128_t</CODE> as aliases for C’s standard floating-point types 1341*9403c583SJens Wiklander<CODE>float</CODE>, <CODE>double</CODE>, and <CODE>long</CODE> 1342*9403c583SJens Wiklander<CODE>double</CODE>, using either <CODE>#define</CODE> or <CODE>typedef</CODE>. 1343*9403c583SJens WiklanderThis potential convenience was not supported under <NOBR>Release 2</NOBR>. 1344*9403c583SJens Wiklander</P> 1345*9403c583SJens Wiklander 1346*9403c583SJens Wiklander<P> 1347*9403c583SJens Wiklander(Note, however, that there may be a performance cost to defining 1348*9403c583SJens WiklanderSoftFloat’s floating-point types this way, depending on the platform and 1349*9403c583SJens Wiklanderthe applications using SoftFloat. 1350*9403c583SJens WiklanderPorts of SoftFloat may choose to forgo the convenience in favor of better 1351*9403c583SJens Wiklanderspeed.) 1352*9403c583SJens Wiklander</P> 1353*9403c583SJens Wiklander 1354*9403c583SJens Wiklander<P> 1355*9403c583SJens Wiklander<LI> 1356*9403c583SJens WiklanderFunctions have been added for converting between the floating-point types and 1357*9403c583SJens Wiklanderunsigned integers. 1358*9403c583SJens Wiklander<NOBR>Release 2</NOBR> supported only signed integers, not unsigned. 1359*9403c583SJens Wiklander</P> 1360*9403c583SJens Wiklander 1361*9403c583SJens Wiklander<P> 1362*9403c583SJens Wiklander<LI> 1363*9403c583SJens WiklanderA new, fifth rounding mode, <CODE>softfloat_round_near_maxMag</CODE> (round to 1364*9403c583SJens Wiklandernearest, with ties to maximum magnitude, away from zero) is now supported for 1365*9403c583SJens Wiklanderall cases involving rounding. 1366*9403c583SJens Wiklander</P> 1367*9403c583SJens Wiklander 1368*9403c583SJens Wiklander<P> 1369*9403c583SJens Wiklander<LI> 1370*9403c583SJens WiklanderFused multiply-add functions have been added for the non-extended formats, 1371*9403c583SJens Wiklander<CODE>float32_t</CODE>, <CODE>float64_t</CODE>, and <CODE>float128_t</CODE>. 1372*9403c583SJens Wiklander</P> 1373*9403c583SJens Wiklander 1374*9403c583SJens Wiklander</UL> 1375*9403c583SJens Wiklander</P> 1376*9403c583SJens Wiklander 1377*9403c583SJens Wiklander<H3>9.4. Better Compatibility with the C Language</H3> 1378*9403c583SJens Wiklander 1379*9403c583SJens Wiklander<P> 1380*9403c583SJens Wiklander<NOBR>Release 3</NOBR> of SoftFloat was written to conform better to the ISO C 1381*9403c583SJens WiklanderStandard’s rules for portability. 1382*9403c583SJens WiklanderFor example, older releases of SoftFloat employed type conversions in ways 1383*9403c583SJens Wiklanderthat, while commonly practiced, are not fully defined by the C Standard. 1384*9403c583SJens WiklanderSuch problematic type conversions have generally been replaced by the use of 1385*9403c583SJens Wiklanderunions, the behavior around which is more strictly regulated these days. 1386*9403c583SJens Wiklander</P> 1387*9403c583SJens Wiklander 1388*9403c583SJens Wiklander<H3>9.5. New Organization as a Library</H3> 1389*9403c583SJens Wiklander 1390*9403c583SJens Wiklander<P> 1391*9403c583SJens WiklanderStarting with <NOBR>Release 3</NOBR>, SoftFloat now builds as a library. 1392*9403c583SJens WiklanderPreviously, SoftFloat compiled into a single, monolithic object file containing 1393*9403c583SJens Wiklanderall the SoftFloat functions, with the consequence that a program linking with 1394*9403c583SJens WiklanderSoftFloat would get every SoftFloat function in its binary file even if only a 1395*9403c583SJens Wiklanderfew functions were actually used. 1396*9403c583SJens WiklanderWith SoftFloat in the form of a library, a program that is linked by a standard 1397*9403c583SJens Wiklanderlinker will include only those functions of SoftFloat that it needs and no 1398*9403c583SJens Wiklanderothers. 1399*9403c583SJens Wiklander</P> 1400*9403c583SJens Wiklander 1401*9403c583SJens Wiklander<H3>9.6. Optimization Gains (and Losses)</H3> 1402*9403c583SJens Wiklander 1403*9403c583SJens Wiklander<P> 1404*9403c583SJens WiklanderIndividual SoftFloat functions have been variously improved in 1405*9403c583SJens Wiklander<NOBR>Release 3</NOBR> compared to earlier releases. 1406*9403c583SJens WiklanderIn particular, better, faster algorithms have been deployed for the operations 1407*9403c583SJens Wiklanderof division, square root, and remainder. 1408*9403c583SJens WiklanderFor functions operating on the larger <NOBR>80-bit</NOBR> and 1409*9403c583SJens Wiklander<NOBR>128-bit</NOBR> formats, <CODE>extFloat80_t</CODE> and 1410*9403c583SJens Wiklander<CODE>float128_t</CODE>, code size has also generally been reduced. 1411*9403c583SJens Wiklander</P> 1412*9403c583SJens Wiklander 1413*9403c583SJens Wiklander<P> 1414*9403c583SJens WiklanderHowever, because <NOBR>Release 2</NOBR> compiled all of SoftFloat together as a 1415*9403c583SJens Wiklandersingle object file, compilers could make optimizations across function calls 1416*9403c583SJens Wiklanderwhen one SoftFloat function calls another. 1417*9403c583SJens WiklanderNow that the functions of SoftFloat are compiled separately and only afterward 1418*9403c583SJens Wiklanderlinked together into a program, there is not usually the same opportunity to 1419*9403c583SJens Wiklanderoptimize across function calls. 1420*9403c583SJens WiklanderSome loss of speed has been observed due to this change. 1421*9403c583SJens Wiklander</P> 1422*9403c583SJens Wiklander 1423*9403c583SJens Wiklander 1424*9403c583SJens Wiklander<H2>10. Future Directions</H2> 1425*9403c583SJens Wiklander 1426*9403c583SJens Wiklander<P> 1427*9403c583SJens WiklanderThe following improvements are anticipated for future releases of SoftFloat: 1428*9403c583SJens Wiklander<UL> 1429*9403c583SJens Wiklander<LI> 1430*9403c583SJens Wiklandersupport for the common <NOBR>16-bit</NOBR> “half-precision” 1431*9403c583SJens Wiklanderfloating-point format; 1432*9403c583SJens Wiklander<LI> 1433*9403c583SJens Wiklandermore functions from the 2008 version of the IEEE Floating-Point Standard; 1434*9403c583SJens Wiklander<LI> 1435*9403c583SJens Wiklanderconsistent, defined behavior for non-canonical representations of extended 1436*9403c583SJens Wiklanderformat <CODE>extFloat80_t</CODE> (discussed in <NOBR>section 4.4</NOBR>, 1437*9403c583SJens Wiklander<I>Non-canonical Representations in <CODE>extFloat80_t</CODE></I>). 1438*9403c583SJens Wiklander 1439*9403c583SJens Wiklander</UL> 1440*9403c583SJens Wiklander</P> 1441*9403c583SJens Wiklander 1442*9403c583SJens Wiklander 1443*9403c583SJens Wiklander<H2>11. Contact Information</H2> 1444*9403c583SJens Wiklander 1445*9403c583SJens Wiklander<P> 1446*9403c583SJens WiklanderAt the time of this writing, the most up-to-date information about SoftFloat 1447*9403c583SJens Wiklanderand the latest release can be found at the Web page 1448*9403c583SJens Wiklander<A HREF="http://www.jhauser.us/arithmetic/SoftFloat.html"><CODE>http://www.jhauser.us/arithmetic/SoftFloat.html</CODE></A>. 1449*9403c583SJens Wiklander</P> 1450*9403c583SJens Wiklander 1451*9403c583SJens Wiklander 1452*9403c583SJens Wiklander</BODY> 1453*9403c583SJens Wiklander 1454