xref: /optee_os/lib/libutils/isoc/arch/arm/softfloat/doc/SoftFloat.html (revision 9403c583381528e7fb391e3769644cc9653cfbb6)
1*9403c583SJens Wiklander
2*9403c583SJens Wiklander<HTML>
3*9403c583SJens Wiklander
4*9403c583SJens Wiklander<HEAD>
5*9403c583SJens Wiklander<TITLE>Berkeley SoftFloat Library Interface</TITLE>
6*9403c583SJens Wiklander</HEAD>
7*9403c583SJens Wiklander
8*9403c583SJens Wiklander<BODY>
9*9403c583SJens Wiklander
10*9403c583SJens Wiklander<H1>Berkeley SoftFloat Release 3a: Library Interface</H1>
11*9403c583SJens Wiklander
12*9403c583SJens Wiklander<P>
13*9403c583SJens WiklanderJohn R. Hauser<BR>
14*9403c583SJens Wiklander2015 October 23<BR>
15*9403c583SJens Wiklander</P>
16*9403c583SJens Wiklander
17*9403c583SJens Wiklander
18*9403c583SJens Wiklander<H2>Contents</H2>
19*9403c583SJens Wiklander
20*9403c583SJens Wiklander<BLOCKQUOTE>
21*9403c583SJens Wiklander<TABLE BORDER=0 CELLSPACING=0 CELLPADDING=0>
22*9403c583SJens Wiklander<COL WIDTH=25>
23*9403c583SJens Wiklander<COL WIDTH=*>
24*9403c583SJens Wiklander<TR><TD COLSPAN=2>1. Introduction</TD></TR>
25*9403c583SJens Wiklander<TR><TD COLSPAN=2>2. Limitations</TD></TR>
26*9403c583SJens Wiklander<TR><TD COLSPAN=2>3. Acknowledgments and License</TD></TR>
27*9403c583SJens Wiklander<TR><TD COLSPAN=2>4. Types and Functions</TD></TR>
28*9403c583SJens Wiklander<TR><TD></TD><TD>4.1. Boolean and Integer Types</TD></TR>
29*9403c583SJens Wiklander<TR><TD></TD><TD>4.2. Floating-Point Types</TD></TR>
30*9403c583SJens Wiklander<TR><TD></TD><TD>4.3. Supported Floating-Point Functions</TD></TR>
31*9403c583SJens Wiklander<TR>
32*9403c583SJens Wiklander  <TD></TD>
33*9403c583SJens Wiklander  <TD>4.4. Non-canonical Representations in <CODE>extFloat80_t</CODE></TD>
34*9403c583SJens Wiklander</TR>
35*9403c583SJens Wiklander<TR><TD></TD><TD>4.5. Conventions for Passing Arguments and Results</TD></TR>
36*9403c583SJens Wiklander<TR><TD COLSPAN=2>5. Reserved Names</TD></TR>
37*9403c583SJens Wiklander<TR><TD COLSPAN=2>6. Mode Variables</TD></TR>
38*9403c583SJens Wiklander<TR><TD></TD><TD>6.1. Rounding Mode</TD></TR>
39*9403c583SJens Wiklander<TR><TD></TD><TD>6.2. Underflow Detection</TD></TR>
40*9403c583SJens Wiklander<TR>
41*9403c583SJens Wiklander  <TD></TD>
42*9403c583SJens Wiklander  <TD>6.3. Rounding Precision for the <NOBR>80-Bit</NOBR> Extended Format</TD>
43*9403c583SJens Wiklander</TR>
44*9403c583SJens Wiklander<TR><TD COLSPAN=2>7. Exceptions and Exception Flags</TD></TR>
45*9403c583SJens Wiklander<TR><TD COLSPAN=2>8. Function Details</TD></TR>
46*9403c583SJens Wiklander<TR><TD></TD><TD>8.1. Conversions from Integer to Floating-Point</TD></TR>
47*9403c583SJens Wiklander<TR><TD></TD><TD>8.2. Conversions from Floating-Point to Integer</TD></TR>
48*9403c583SJens Wiklander<TR><TD></TD><TD>8.3. Conversions Among Floating-Point Types</TD></TR>
49*9403c583SJens Wiklander<TR><TD></TD><TD>8.4. Basic Arithmetic Functions</TD></TR>
50*9403c583SJens Wiklander<TR><TD></TD><TD>8.5. Fused Multiply-Add Functions</TD></TR>
51*9403c583SJens Wiklander<TR><TD></TD><TD>8.6. Remainder Functions</TD></TR>
52*9403c583SJens Wiklander<TR><TD></TD><TD>8.7. Round-to-Integer Functions</TD></TR>
53*9403c583SJens Wiklander<TR><TD></TD><TD>8.8. Comparison Functions</TD></TR>
54*9403c583SJens Wiklander<TR><TD></TD><TD>8.9. Signaling NaN Test Functions</TD></TR>
55*9403c583SJens Wiklander<TR><TD></TD><TD>8.10. Raise-Exception Function</TD></TR>
56*9403c583SJens Wiklander<TR><TD COLSPAN=2>9. Changes from SoftFloat <NOBR>Release 2</NOBR></TD></TR>
57*9403c583SJens Wiklander<TR><TD></TD><TD>9.1. Name Changes</TD></TR>
58*9403c583SJens Wiklander<TR><TD></TD><TD>9.2. Changes to Function Arguments</TD></TR>
59*9403c583SJens Wiklander<TR><TD></TD><TD>9.3. Added Capabilities</TD></TR>
60*9403c583SJens Wiklander<TR><TD></TD><TD>9.4. Better Compatibility with the C Language</TD></TR>
61*9403c583SJens Wiklander<TR><TD></TD><TD>9.5. New Organization as a Library</TD></TR>
62*9403c583SJens Wiklander<TR><TD></TD><TD>9.6. Optimization Gains (and Losses)</TD></TR>
63*9403c583SJens Wiklander<TR><TD COLSPAN=2>10. Future Directions</TD></TR>
64*9403c583SJens Wiklander<TR><TD COLSPAN=2>11. Contact Information</TD></TR>
65*9403c583SJens Wiklander</TABLE>
66*9403c583SJens Wiklander</BLOCKQUOTE>
67*9403c583SJens Wiklander
68*9403c583SJens Wiklander
69*9403c583SJens Wiklander<H2>1. Introduction</H2>
70*9403c583SJens Wiklander
71*9403c583SJens Wiklander<P>
72*9403c583SJens WiklanderBerkeley SoftFloat is a software implementation of binary floating-point that
73*9403c583SJens Wiklanderconforms to the IEEE Standard for Floating-Point Arithmetic.
74*9403c583SJens WiklanderThe current release supports four binary formats:  <NOBR>32-bit</NOBR>
75*9403c583SJens Wiklandersingle-precision, <NOBR>64-bit</NOBR> double-precision, <NOBR>80-bit</NOBR>
76*9403c583SJens Wiklanderdouble-extended-precision, and <NOBR>128-bit</NOBR> quadruple-precision.
77*9403c583SJens WiklanderThe following functions are supported for each format:
78*9403c583SJens Wiklander<UL>
79*9403c583SJens Wiklander<LI>
80*9403c583SJens Wiklanderaddition, subtraction, multiplication, division, and square root;
81*9403c583SJens Wiklander<LI>
82*9403c583SJens Wiklanderfused multiply-add as defined by the IEEE Standard, except for
83*9403c583SJens Wiklander<NOBR>80-bit</NOBR> double-extended-precision;
84*9403c583SJens Wiklander<LI>
85*9403c583SJens Wiklanderremainder as defined by the IEEE Standard;
86*9403c583SJens Wiklander<LI>
87*9403c583SJens Wiklanderround to integral value;
88*9403c583SJens Wiklander<LI>
89*9403c583SJens Wiklandercomparisons;
90*9403c583SJens Wiklander<LI>
91*9403c583SJens Wiklanderconversions to/from other supported formats; and
92*9403c583SJens Wiklander<LI>
93*9403c583SJens Wiklanderconversions to/from <NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR> integers,
94*9403c583SJens Wiklandersigned and unsigned.
95*9403c583SJens Wiklander</UL>
96*9403c583SJens WiklanderAll operations required by the original 1985 version of the IEEE Floating-Point
97*9403c583SJens WiklanderStandard are implemented, except for conversions to and from decimal.
98*9403c583SJens Wiklander</P>
99*9403c583SJens Wiklander
100*9403c583SJens Wiklander<P>
101*9403c583SJens WiklanderThis document gives information about the types defined and the routines
102*9403c583SJens Wiklanderimplemented by SoftFloat.
103*9403c583SJens WiklanderIt does not attempt to define or explain the IEEE Floating-Point Standard.
104*9403c583SJens WiklanderInformation about the standard is available elsewhere.
105*9403c583SJens Wiklander</P>
106*9403c583SJens Wiklander
107*9403c583SJens Wiklander<P>
108*9403c583SJens WiklanderThe current version of SoftFloat is <NOBR>Release 3a</NOBR>.
109*9403c583SJens WiklanderThe only difference between this version and the previous
110*9403c583SJens Wiklander<NOBR>Release 3</NOBR> is the replacement of the license text supplied by the
111*9403c583SJens WiklanderUniversity of California.
112*9403c583SJens Wiklander</P>
113*9403c583SJens Wiklander
114*9403c583SJens Wiklander<P>
115*9403c583SJens WiklanderThe functional interface of SoftFloat <NOBR>Release 3</NOBR> and afterward
116*9403c583SJens Wiklanderdiffers in many details from that of earlier releases.
117*9403c583SJens WiklanderFor specifics of these differences, see <NOBR>section 9</NOBR> below,
118*9403c583SJens Wiklander<I>Changes from SoftFloat <NOBR>Release 2</NOBR></I>.
119*9403c583SJens Wiklander</P>
120*9403c583SJens Wiklander
121*9403c583SJens Wiklander
122*9403c583SJens Wiklander<H2>2. Limitations</H2>
123*9403c583SJens Wiklander
124*9403c583SJens Wiklander<P>
125*9403c583SJens WiklanderSoftFloat assumes the computer has an addressable byte size of 8 or
126*9403c583SJens Wiklander<NOBR>16 bits</NOBR>.
127*9403c583SJens Wiklander(Nearly all computers in use today have <NOBR>8-bit</NOBR> bytes.)
128*9403c583SJens Wiklander</P>
129*9403c583SJens Wiklander
130*9403c583SJens Wiklander<P>
131*9403c583SJens WiklanderSoftFloat is written in C and is designed to work with other C code.
132*9403c583SJens WiklanderThe C compiler used must conform at a minimum to the 1989 ANSI standard for the
133*9403c583SJens WiklanderC language (same as the 1990 ISO standard) and must in addition support basic
134*9403c583SJens Wiklanderarithmetic on <NOBR>64-bit</NOBR> integers.
135*9403c583SJens WiklanderEarlier releases of SoftFloat included implementations of <NOBR>32-bit</NOBR>
136*9403c583SJens Wiklandersingle-precision and <NOBR>64-bit</NOBR> double-precision floating-point that
137*9403c583SJens Wiklanderdid not require <NOBR>64-bit</NOBR> integers, but this option is not supported
138*9403c583SJens Wiklanderstarting with <NOBR>Release 3</NOBR>.
139*9403c583SJens WiklanderSince 1999, ISO standards for C have mandated compiler support for
140*9403c583SJens Wiklander<NOBR>64-bit</NOBR> integers.
141*9403c583SJens WiklanderA compiler conforming to the 1999 C Standard or later is recommended but not
142*9403c583SJens Wiklanderstrictly required.
143*9403c583SJens Wiklander</P>
144*9403c583SJens Wiklander
145*9403c583SJens Wiklander<P>
146*9403c583SJens WiklanderMost operations not required by the original 1985 version of the IEEE
147*9403c583SJens WiklanderFloating-Point Standard but added in the 2008 version are not yet supported in
148*9403c583SJens WiklanderSoftFloat <NOBR>Release 3a</NOBR>.
149*9403c583SJens Wiklander</P>
150*9403c583SJens Wiklander
151*9403c583SJens Wiklander
152*9403c583SJens Wiklander<H2>3. Acknowledgments and License</H2>
153*9403c583SJens Wiklander
154*9403c583SJens Wiklander<P>
155*9403c583SJens WiklanderThe SoftFloat package was written by me, <NOBR>John R.</NOBR> Hauser.
156*9403c583SJens Wiklander<NOBR>Release 3</NOBR> of SoftFloat was a completely new implementation
157*9403c583SJens Wiklandersupplanting earlier releases.
158*9403c583SJens WiklanderThe project to create <NOBR>Release 3</NOBR> (and <NOBR>now 3a</NOBR>) was done
159*9403c583SJens Wiklanderin the employ of the University of California, Berkeley, within the Department
160*9403c583SJens Wiklanderof Electrical Engineering and Computer Sciences, first for the Parallel
161*9403c583SJens WiklanderComputing Laboratory (Par Lab) and then for the ASPIRE Lab.
162*9403c583SJens WiklanderThe work was officially overseen by Prof. Krste Asanovic, with funding provided
163*9403c583SJens Wiklanderby these sources:
164*9403c583SJens Wiklander<BLOCKQUOTE>
165*9403c583SJens Wiklander<TABLE>
166*9403c583SJens Wiklander<COL>
167*9403c583SJens Wiklander<COL WIDTH=10>
168*9403c583SJens Wiklander<COL>
169*9403c583SJens Wiklander<TR>
170*9403c583SJens Wiklander<TD VALIGN=TOP><NOBR>Par Lab:</NOBR></TD>
171*9403c583SJens Wiklander<TD></TD>
172*9403c583SJens Wiklander<TD>
173*9403c583SJens WiklanderMicrosoft (Award #024263), Intel (Award #024894), and U.C. Discovery
174*9403c583SJens Wiklander(Award #DIG07-10227), with additional support from Par Lab affiliates Nokia,
175*9403c583SJens WiklanderNVIDIA, Oracle, and Samsung.
176*9403c583SJens Wiklander</TD>
177*9403c583SJens Wiklander</TR>
178*9403c583SJens Wiklander<TR>
179*9403c583SJens Wiklander<TD VALIGN=TOP><NOBR>ASPIRE Lab:</NOBR></TD>
180*9403c583SJens Wiklander<TD></TD>
181*9403c583SJens Wiklander<TD>
182*9403c583SJens WiklanderDARPA PERFECT program (Award #HR0011-12-2-0016), with additional support from
183*9403c583SJens WiklanderASPIRE industrial sponsor Intel and ASPIRE affiliates Google, Nokia, NVIDIA,
184*9403c583SJens WiklanderOracle, and Samsung.
185*9403c583SJens Wiklander</TD>
186*9403c583SJens Wiklander</TR>
187*9403c583SJens Wiklander</TABLE>
188*9403c583SJens Wiklander</BLOCKQUOTE>
189*9403c583SJens Wiklander</P>
190*9403c583SJens Wiklander
191*9403c583SJens Wiklander<P>
192*9403c583SJens WiklanderThe following applies to the whole of SoftFloat <NOBR>Release 3a</NOBR> as well
193*9403c583SJens Wiklanderas to each source file individually.
194*9403c583SJens Wiklander</P>
195*9403c583SJens Wiklander
196*9403c583SJens Wiklander<P>
197*9403c583SJens WiklanderCopyright 2011, 2012, 2013, 2014, 2015 The Regents of the University of
198*9403c583SJens WiklanderCalifornia.
199*9403c583SJens WiklanderAll rights reserved.
200*9403c583SJens Wiklander</P>
201*9403c583SJens Wiklander
202*9403c583SJens Wiklander<P>
203*9403c583SJens WiklanderRedistribution and use in source and binary forms, with or without
204*9403c583SJens Wiklandermodification, are permitted provided that the following conditions are met:
205*9403c583SJens Wiklander<OL>
206*9403c583SJens Wiklander
207*9403c583SJens Wiklander<LI>
208*9403c583SJens Wiklander<P>
209*9403c583SJens WiklanderRedistributions of source code must retain the above copyright notice, this
210*9403c583SJens Wiklanderlist of conditions, and the following disclaimer.
211*9403c583SJens Wiklander</P>
212*9403c583SJens Wiklander
213*9403c583SJens Wiklander<LI>
214*9403c583SJens Wiklander<P>
215*9403c583SJens WiklanderRedistributions in binary form must reproduce the above copyright notice, this
216*9403c583SJens Wiklanderlist of conditions, and the following disclaimer in the documentation and/or
217*9403c583SJens Wiklanderother materials provided with the distribution.
218*9403c583SJens Wiklander</P>
219*9403c583SJens Wiklander
220*9403c583SJens Wiklander<LI>
221*9403c583SJens Wiklander<P>
222*9403c583SJens WiklanderNeither the name of the University nor the names of its contributors may be
223*9403c583SJens Wiklanderused to endorse or promote products derived from this software without specific
224*9403c583SJens Wiklanderprior written permission.
225*9403c583SJens Wiklander</P>
226*9403c583SJens Wiklander
227*9403c583SJens Wiklander</OL>
228*9403c583SJens Wiklander</P>
229*9403c583SJens Wiklander
230*9403c583SJens Wiklander<P>
231*9403c583SJens WiklanderTHIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS &ldquo;AS IS&rdquo;,
232*9403c583SJens WiklanderAND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
233*9403c583SJens WiklanderIMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, ARE
234*9403c583SJens WiklanderDISCLAIMED.
235*9403c583SJens WiklanderIN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
236*9403c583SJens WiklanderINDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
237*9403c583SJens WiklanderBUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
238*9403c583SJens WiklanderDATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
239*9403c583SJens WiklanderLIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
240*9403c583SJens WiklanderOR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
241*9403c583SJens WiklanderADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
242*9403c583SJens Wiklander</P>
243*9403c583SJens Wiklander
244*9403c583SJens Wiklander
245*9403c583SJens Wiklander<H2>4. Types and Functions</H2>
246*9403c583SJens Wiklander
247*9403c583SJens Wiklander<P>
248*9403c583SJens WiklanderThe types and functions of SoftFloat are declared in header file
249*9403c583SJens Wiklander<CODE>softfloat.h</CODE>.
250*9403c583SJens Wiklander</P>
251*9403c583SJens Wiklander
252*9403c583SJens Wiklander<H3>4.1. Boolean and Integer Types</H3>
253*9403c583SJens Wiklander
254*9403c583SJens Wiklander<P>
255*9403c583SJens WiklanderHeader file <CODE>softfloat.h</CODE> depends on standard headers
256*9403c583SJens Wiklander<CODE>&lt;stdbool.h&gt;</CODE> and <CODE>&lt;stdint.h&gt;</CODE> to define type
257*9403c583SJens Wiklander<CODE>bool</CODE> and several integer types.
258*9403c583SJens WiklanderThese standard headers have been part of the ISO C Standard Library since 1999.
259*9403c583SJens WiklanderWith any recent compiler, they are likely to be supported, even if the compiler
260*9403c583SJens Wiklanderdoes not claim complete conformance to the ISO C Standard.
261*9403c583SJens WiklanderFor older or nonstandard compilers, a port of SoftFloat may have substitutes
262*9403c583SJens Wiklanderfor these headers.
263*9403c583SJens WiklanderHeader <CODE>softfloat.h</CODE> depends only on the name <CODE>bool</CODE> from
264*9403c583SJens Wiklander<CODE>&lt;stdbool.h&gt;</CODE> and on these type names from
265*9403c583SJens Wiklander<CODE>&lt;stdint.h&gt;</CODE>:
266*9403c583SJens Wiklander<BLOCKQUOTE>
267*9403c583SJens Wiklander<PRE>
268*9403c583SJens Wiklanderuint16_t
269*9403c583SJens Wiklanderuint32_t
270*9403c583SJens Wiklanderuint64_t
271*9403c583SJens Wiklanderint32_t
272*9403c583SJens Wiklanderint64_t
273*9403c583SJens Wiklanderuint_fast8_t
274*9403c583SJens Wiklanderuint_fast32_t
275*9403c583SJens Wiklanderuint_fast64_t
276*9403c583SJens Wiklander</PRE>
277*9403c583SJens Wiklander</BLOCKQUOTE>
278*9403c583SJens Wiklander</P>
279*9403c583SJens Wiklander
280*9403c583SJens Wiklander
281*9403c583SJens Wiklander<H3>4.2. Floating-Point Types</H3>
282*9403c583SJens Wiklander
283*9403c583SJens Wiklander<P>
284*9403c583SJens WiklanderThe <CODE>softfloat.h</CODE> header defines four floating-point types:
285*9403c583SJens Wiklander<BLOCKQUOTE>
286*9403c583SJens Wiklander<TABLE CELLSPACING=0 CELLPADDING=0>
287*9403c583SJens Wiklander<TR>
288*9403c583SJens Wiklander<TD><CODE>float32_t</CODE></TD>
289*9403c583SJens Wiklander<TD><NOBR>32-bit</NOBR> single-precision binary format</TD>
290*9403c583SJens Wiklander</TR>
291*9403c583SJens Wiklander<TR>
292*9403c583SJens Wiklander<TD><CODE>float64_t</CODE></TD>
293*9403c583SJens Wiklander<TD><NOBR>64-bit</NOBR> double-precision binary format</TD>
294*9403c583SJens Wiklander</TR>
295*9403c583SJens Wiklander<TR>
296*9403c583SJens Wiklander<TD><CODE>extFloat80_t&nbsp;&nbsp;&nbsp;</CODE></TD>
297*9403c583SJens Wiklander<TD><NOBR>80-bit</NOBR> double-extended-precision binary format (old Intel or
298*9403c583SJens WiklanderMotorola format)</TD>
299*9403c583SJens Wiklander</TR>
300*9403c583SJens Wiklander<TR>
301*9403c583SJens Wiklander<TD><CODE>float128_t</CODE></TD>
302*9403c583SJens Wiklander<TD><NOBR>128-bit</NOBR> quadruple-precision binary format</TD>
303*9403c583SJens Wiklander</TR>
304*9403c583SJens Wiklander</TABLE>
305*9403c583SJens Wiklander</BLOCKQUOTE>
306*9403c583SJens WiklanderThe non-extended types are each exactly the size specified:
307*9403c583SJens Wiklander<NOBR>32 bits</NOBR> for <CODE>float32_t</CODE>, <NOBR>64 bits</NOBR> for
308*9403c583SJens Wiklander<CODE>float64_t</CODE>, and <NOBR>128 bits</NOBR> for <CODE>float128_t</CODE>.
309*9403c583SJens WiklanderAside from these size requirements, the definitions of all these types may
310*9403c583SJens Wiklanderdiffer for different ports of SoftFloat to specific systems.
311*9403c583SJens WiklanderA given port of SoftFloat may or may not define some of the floating-point
312*9403c583SJens Wiklandertypes as aliases for the C standard types <CODE>float</CODE>,
313*9403c583SJens Wiklander<CODE>double</CODE>, and <CODE>long</CODE> <CODE>double</CODE>.
314*9403c583SJens Wiklander</P>
315*9403c583SJens Wiklander
316*9403c583SJens Wiklander<P>
317*9403c583SJens WiklanderHeader file <CODE>softfloat.h</CODE> also defines a structure,
318*9403c583SJens Wiklander<CODE>struct</CODE> <CODE>extFloat80M</CODE>, for the representation of
319*9403c583SJens Wiklander<NOBR>80-bit</NOBR> double-extended-precision floating-point values in memory.
320*9403c583SJens WiklanderThis structure is the same size as type <CODE>extFloat80_t</CODE> and contains
321*9403c583SJens Wiklanderat least these two fields (not necessarily in this order):
322*9403c583SJens Wiklander<BLOCKQUOTE>
323*9403c583SJens Wiklander<PRE>
324*9403c583SJens Wiklanderuint16_t signExp;
325*9403c583SJens Wiklanderuint64_t signif;
326*9403c583SJens Wiklander</PRE>
327*9403c583SJens Wiklander</BLOCKQUOTE>
328*9403c583SJens WiklanderField <CODE>signExp</CODE> contains the sign and exponent of the floating-point
329*9403c583SJens Wiklandervalue, with the sign in the most significant bit (<NOBR>bit 15</NOBR>) and the
330*9403c583SJens Wiklanderencoded exponent in the other <NOBR>15 bits</NOBR>.
331*9403c583SJens WiklanderField <CODE>signif</CODE> is the complete <NOBR>64-bit</NOBR> significand of
332*9403c583SJens Wiklanderthe floating-point value.
333*9403c583SJens Wiklander(In the usual encoding for <NOBR>80-bit</NOBR> extended floating-point, the
334*9403c583SJens Wiklanderleading <NOBR>1 bit</NOBR> of normalized numbers is not implicit but is stored
335*9403c583SJens Wiklanderin the most significant bit of the significand.)
336*9403c583SJens Wiklander</P>
337*9403c583SJens Wiklander
338*9403c583SJens Wiklander<H3>4.3. Supported Floating-Point Functions</H3>
339*9403c583SJens Wiklander
340*9403c583SJens Wiklander<P>
341*9403c583SJens WiklanderSoftFloat implements these arithmetic operations for its floating-point types:
342*9403c583SJens Wiklander<UL>
343*9403c583SJens Wiklander<LI>
344*9403c583SJens Wiklanderconversions between any two floating-point formats;
345*9403c583SJens Wiklander<LI>
346*9403c583SJens Wiklanderfor each floating-point format, conversions to and from signed and unsigned
347*9403c583SJens Wiklander<NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR> integers;
348*9403c583SJens Wiklander<LI>
349*9403c583SJens Wiklanderfor each format, the usual addition, subtraction, multiplication, division, and
350*9403c583SJens Wiklandersquare root operations;
351*9403c583SJens Wiklander<LI>
352*9403c583SJens Wiklanderfor each format except <CODE>extFloat80_t</CODE>, the fused multiply-add
353*9403c583SJens Wiklanderoperation defined by the IEEE Standard;
354*9403c583SJens Wiklander<LI>
355*9403c583SJens Wiklanderfor each format, the floating-point remainder operation defined by the IEEE
356*9403c583SJens WiklanderStandard;
357*9403c583SJens Wiklander<LI>
358*9403c583SJens Wiklanderfor each format, a &ldquo;round to integer&rdquo; operation that rounds to the
359*9403c583SJens Wiklandernearest integer value in the same format; and
360*9403c583SJens Wiklander<LI>
361*9403c583SJens Wiklandercomparisons between two values in the same floating-point format.
362*9403c583SJens Wiklander</UL>
363*9403c583SJens Wiklander</P>
364*9403c583SJens Wiklander
365*9403c583SJens Wiklander<P>
366*9403c583SJens WiklanderThe following operations required by the 2008 IEEE Floating-Point Standard are
367*9403c583SJens Wiklandernot supported in SoftFloat <NOBR>Release 3a</NOBR>:
368*9403c583SJens Wiklander<UL>
369*9403c583SJens Wiklander<LI>
370*9403c583SJens Wiklander<B>nextUp</B>, <B>nextDown</B>, <B>minNum</B>, <B>maxNum</B>, <B>minNumMag</B>,
371*9403c583SJens Wiklander<B>maxNumMag</B>, <B>scaleB</B>, and <B>logB</B>;
372*9403c583SJens Wiklander<LI>
373*9403c583SJens Wiklanderconversions between floating-point formats and decimal or hexadecimal character
374*9403c583SJens Wiklandersequences;
375*9403c583SJens Wiklander<LI>
376*9403c583SJens Wiklanderall &ldquo;quiet-computation&rdquo; operations (<B>copy</B>, <B>negate</B>,
377*9403c583SJens Wiklander<B>abs</B>, and <B>copySign</B>, which all involve only simple copying and/or
378*9403c583SJens Wiklandermanipulation of the floating-point sign bit); and
379*9403c583SJens Wiklander<LI>
380*9403c583SJens Wiklanderall &ldquo;non-computational&rdquo; operations other than <B>isSignaling</B>
381*9403c583SJens Wiklander(which is supported).
382*9403c583SJens Wiklander</UL>
383*9403c583SJens Wiklander</P>
384*9403c583SJens Wiklander
385*9403c583SJens Wiklander<H3>4.4. Non-canonical Representations in <CODE>extFloat80_t</CODE></H3>
386*9403c583SJens Wiklander
387*9403c583SJens Wiklander<P>
388*9403c583SJens WiklanderBecause the <NOBR>80-bit</NOBR> double-extended-precision format,
389*9403c583SJens Wiklander<CODE>extFloat80_t</CODE>, stores an explicit leading significand bit, many
390*9403c583SJens Wiklanderfloating-point numbers are encodable in this type in equivalent normalized and
391*9403c583SJens Wiklanderdenormalized forms.
392*9403c583SJens WiklanderZeros and values in the subnormal range have each only a single possible
393*9403c583SJens Wiklanderencoding, for which the leading significand bit must <NOBR>be 0</NOBR>.
394*9403c583SJens WiklanderFor other finite values (outside the subnormal range), a unique normalized
395*9403c583SJens Wiklanderrepresentation, with leading significand bit set <NOBR>to 1</NOBR>, always
396*9403c583SJens Wiklanderexists, and is considered the <I>canonical</I> representation of the value.
397*9403c583SJens WiklanderAny equivalent denormalized representations (having leading significand bit
398*9403c583SJens Wiklander<NOBR>of 0</NOBR>) are <I>non-canonical</I>.
399*9403c583SJens WiklanderSimilarly, the leading significand bit is expected to <NOBR>be 1</NOBR> for
400*9403c583SJens Wiklanderinfinities and NaNs as well;
401*9403c583SJens Wiklanderany infinity or NaN with a leading significand bit <NOBR>of 0</NOBR> is again
402*9403c583SJens Wiklanderconsidered non-canonical.
403*9403c583SJens WiklanderIn short, for an <CODE>extFloat80_t</CODE> representation to be canonical, the
404*9403c583SJens Wiklanderleading significand bit must <NOBR>be 1</NOBR> unless it is required to
405*9403c583SJens Wiklander<NOBR>be 0</NOBR> because the encoded value is zero or a subnormal.
406*9403c583SJens Wiklander</P>
407*9403c583SJens Wiklander
408*9403c583SJens Wiklander<P>
409*9403c583SJens WiklanderFunctions are not guaranteed to operate as expected when inputs of type
410*9403c583SJens Wiklander<CODE>extFloat80_t</CODE> are non-canonical.
411*9403c583SJens WiklanderAssuming all of a function&rsquo;s <CODE>extFloat80_t</CODE> inputs (if any)
412*9403c583SJens Wiklanderare canonical, function outputs of type <CODE>extFloat80_t</CODE> will always
413*9403c583SJens Wiklanderbe canonical.
414*9403c583SJens Wiklander</P>
415*9403c583SJens Wiklander
416*9403c583SJens Wiklander<H3>4.5. Conventions for Passing Arguments and Results</H3>
417*9403c583SJens Wiklander
418*9403c583SJens Wiklander<P>
419*9403c583SJens WiklanderValues that are at most <NOBR>64 bits</NOBR> in size (i.e., not the
420*9403c583SJens Wiklander<NOBR>80-bit</NOBR> or <NOBR>128-bit</NOBR> floating-point formats) are in all
421*9403c583SJens Wiklandercases passed as function arguments by value.
422*9403c583SJens WiklanderLikewise, when an output of a function is no more than <NOBR>64 bits</NOBR>, it
423*9403c583SJens Wiklanderis always returned directly as the function result.
424*9403c583SJens WiklanderThus, for example, the SoftFloat function for adding two <NOBR>64-bit</NOBR>
425*9403c583SJens Wiklanderfloating-point values has this simple signature:
426*9403c583SJens Wiklander<BLOCKQUOTE>
427*9403c583SJens Wiklander<CODE>float64_t f64_add( float64_t, float64_t );</CODE>
428*9403c583SJens Wiklander</BLOCKQUOTE>
429*9403c583SJens Wiklander</P>
430*9403c583SJens Wiklander
431*9403c583SJens Wiklander<P>
432*9403c583SJens WiklanderThe story is more complex when function inputs and outputs are
433*9403c583SJens Wiklander<NOBR>80-bit</NOBR> and <NOBR>128-bit</NOBR> floating-point.
434*9403c583SJens WiklanderFor these types, SoftFloat always provides a function that passes these larger
435*9403c583SJens Wiklandervalues into or out of the function indirectly, via pointers.
436*9403c583SJens WiklanderFor example, for adding two <NOBR>128-bit</NOBR> floating-point values,
437*9403c583SJens WiklanderSoftFloat supplies this function:
438*9403c583SJens Wiklander<BLOCKQUOTE>
439*9403c583SJens Wiklander<CODE>void f128M_add( const float128_t *, const float128_t *, float128_t * );</CODE>
440*9403c583SJens Wiklander</BLOCKQUOTE>
441*9403c583SJens WiklanderThe first two arguments point to the values to be added, and the last argument
442*9403c583SJens Wiklanderpoints to the location where the sum will be stored.
443*9403c583SJens WiklanderThe <CODE>M</CODE> in the name <CODE>f128M_add</CODE> is mnemonic for the fact
444*9403c583SJens Wiklanderthat the <NOBR>128-bit</NOBR> inputs and outputs are &ldquo;in memory&rdquo;,
445*9403c583SJens Wiklanderpointed to by pointer arguments.
446*9403c583SJens Wiklander</P>
447*9403c583SJens Wiklander
448*9403c583SJens Wiklander<P>
449*9403c583SJens WiklanderAll ports of SoftFloat implement these <I>pass-by-pointer</I> functions for
450*9403c583SJens Wiklandertypes <CODE>extFloat80_t</CODE> and <CODE>float128_t</CODE>.
451*9403c583SJens WiklanderAt the same time, SoftFloat ports may also implement alternate versions of
452*9403c583SJens Wiklanderthese same functions that pass <CODE>extFloat80_t</CODE> and
453*9403c583SJens Wiklander<CODE>float128_t</CODE> by value, like the smaller formats.
454*9403c583SJens WiklanderThus, besides the function with name <CODE>f128M_add</CODE> shown above, a
455*9403c583SJens WiklanderSoftFloat port may also supply an equivalent function with this signature:
456*9403c583SJens Wiklander<BLOCKQUOTE>
457*9403c583SJens Wiklander<CODE>float128_t f128_add( float128_t, float128_t );</CODE>
458*9403c583SJens Wiklander</BLOCKQUOTE>
459*9403c583SJens Wiklander</P>
460*9403c583SJens Wiklander
461*9403c583SJens Wiklander<P>
462*9403c583SJens WiklanderAs a general rule, on computers where the machine word size is
463*9403c583SJens Wiklander<NOBR>32 bits</NOBR> or smaller, only the pass-by-pointer versions of functions
464*9403c583SJens Wiklander(e.g., <CODE>f128M_add</CODE>) are provided for types <CODE>extFloat80_t</CODE>
465*9403c583SJens Wiklanderand <CODE>float128_t</CODE>, because passing such large types directly can have
466*9403c583SJens Wiklandersignificant extra cost.
467*9403c583SJens WiklanderOn computers where the word size is <NOBR>64 bits</NOBR> or larger, both
468*9403c583SJens Wiklanderfunction versions (<CODE>f128M_add</CODE> and <CODE>f128_add</CODE>) are
469*9403c583SJens Wiklanderprovided, because the cost of passing by value is then more reasonable.
470*9403c583SJens WiklanderApplications that must be portable accross both classes of computers must use
471*9403c583SJens Wiklanderthe pointer-based functions, as these are always implemented.
472*9403c583SJens WiklanderHowever, if it is known that SoftFloat includes the by-value functions for all
473*9403c583SJens Wiklanderplatforms of interest, programmers can use whichever version they prefer.
474*9403c583SJens Wiklander</P>
475*9403c583SJens Wiklander
476*9403c583SJens Wiklander
477*9403c583SJens Wiklander<H2>5. Reserved Names</H2>
478*9403c583SJens Wiklander
479*9403c583SJens Wiklander<P>
480*9403c583SJens WiklanderIn addition to the variables and functions documented here, SoftFloat defines
481*9403c583SJens Wiklandersome symbol names for its own private use.
482*9403c583SJens WiklanderThese private names always begin with the prefix
483*9403c583SJens Wiklander&lsquo;<CODE>softfloat_</CODE>&rsquo;.
484*9403c583SJens WiklanderWhen a program includes header <CODE>softfloat.h</CODE> or links with the
485*9403c583SJens WiklanderSoftFloat library, all names with prefix &lsquo;<CODE>softfloat_</CODE>&rsquo;
486*9403c583SJens Wiklanderare reserved for possible use by SoftFloat.
487*9403c583SJens WiklanderApplications that use SoftFloat should not define their own names with this
488*9403c583SJens Wiklanderprefix, and should reference only such names as are documented.
489*9403c583SJens Wiklander</P>
490*9403c583SJens Wiklander
491*9403c583SJens Wiklander
492*9403c583SJens Wiklander<H2>6. Mode Variables</H2>
493*9403c583SJens Wiklander
494*9403c583SJens Wiklander<P>
495*9403c583SJens WiklanderThe following variables control rounding mode, underflow detection, and the
496*9403c583SJens Wiklander<NOBR>80-bit</NOBR> extended format&rsquo;s rounding precision:
497*9403c583SJens Wiklander<BLOCKQUOTE>
498*9403c583SJens Wiklander<CODE>softfloat_roundingMode</CODE><BR>
499*9403c583SJens Wiklander<CODE>softfloat_detectTininess</CODE><BR>
500*9403c583SJens Wiklander<CODE>extF80_roundingPrecision</CODE>
501*9403c583SJens Wiklander</BLOCKQUOTE>
502*9403c583SJens WiklanderThese mode variables are covered in the next several subsections.
503*9403c583SJens Wiklander</P>
504*9403c583SJens Wiklander
505*9403c583SJens Wiklander<H3>6.1. Rounding Mode</H3>
506*9403c583SJens Wiklander
507*9403c583SJens Wiklander<P>
508*9403c583SJens WiklanderAll five rounding modes defined by the 2008 IEEE Floating-Point Standard are
509*9403c583SJens Wiklanderimplemented for all operations that require rounding.
510*9403c583SJens WiklanderThe rounding mode is selected by the global variable
511*9403c583SJens Wiklander<BLOCKQUOTE>
512*9403c583SJens Wiklander<CODE>uint_fast8_t softfloat_roundingMode;</CODE>
513*9403c583SJens Wiklander</BLOCKQUOTE>
514*9403c583SJens WiklanderThis variable may be set to one of the values
515*9403c583SJens Wiklander<BLOCKQUOTE>
516*9403c583SJens Wiklander<TABLE CELLSPACING=0 CELLPADDING=0>
517*9403c583SJens Wiklander<TR>
518*9403c583SJens Wiklander<TD><CODE>softfloat_round_near_even</CODE></TD>
519*9403c583SJens Wiklander<TD>round to nearest, with ties to even</TD>
520*9403c583SJens Wiklander</TR>
521*9403c583SJens Wiklander<TR>
522*9403c583SJens Wiklander<TD><CODE>softfloat_round_near_maxMag&nbsp;&nbsp;</CODE></TD>
523*9403c583SJens Wiklander<TD>round to nearest, with ties to maximum magnitude (away from zero)</TD>
524*9403c583SJens Wiklander</TR>
525*9403c583SJens Wiklander<TR>
526*9403c583SJens Wiklander<TD><CODE>softfloat_round_minMag</CODE></TD>
527*9403c583SJens Wiklander<TD>round to minimum magnitude (toward zero)</TD>
528*9403c583SJens Wiklander</TR>
529*9403c583SJens Wiklander<TR>
530*9403c583SJens Wiklander<TD><CODE>softfloat_round_min</CODE></TD>
531*9403c583SJens Wiklander<TD>round to minimum (down)</TD>
532*9403c583SJens Wiklander</TR>
533*9403c583SJens Wiklander<TR>
534*9403c583SJens Wiklander<TD><CODE>softfloat_round_max</CODE></TD>
535*9403c583SJens Wiklander<TD>round to maximum (up)</TD>
536*9403c583SJens Wiklander</TR>
537*9403c583SJens Wiklander</TABLE>
538*9403c583SJens Wiklander</BLOCKQUOTE>
539*9403c583SJens WiklanderVariable <CODE>softfloat_roundingMode</CODE> is initialized to
540*9403c583SJens Wiklander<CODE>softfloat_round_near_even</CODE>.
541*9403c583SJens Wiklander</P>
542*9403c583SJens Wiklander
543*9403c583SJens Wiklander<H3>6.2. Underflow Detection</H3>
544*9403c583SJens Wiklander
545*9403c583SJens Wiklander<P>
546*9403c583SJens WiklanderIn the terminology of the IEEE Standard, SoftFloat can detect tininess for
547*9403c583SJens Wiklanderunderflow either before or after rounding.
548*9403c583SJens WiklanderThe choice is made by the global variable
549*9403c583SJens Wiklander<BLOCKQUOTE>
550*9403c583SJens Wiklander<CODE>uint_fast8_t softfloat_detectTininess;</CODE>
551*9403c583SJens Wiklander</BLOCKQUOTE>
552*9403c583SJens Wiklanderwhich can be set to either
553*9403c583SJens Wiklander<BLOCKQUOTE>
554*9403c583SJens Wiklander<CODE>softfloat_tininess_beforeRounding</CODE><BR>
555*9403c583SJens Wiklander<CODE>softfloat_tininess_afterRounding</CODE>
556*9403c583SJens Wiklander</BLOCKQUOTE>
557*9403c583SJens WiklanderDetecting tininess after rounding is better because it results in fewer
558*9403c583SJens Wiklanderspurious underflow signals.
559*9403c583SJens WiklanderThe other option is provided for compatibility with some systems.
560*9403c583SJens WiklanderLike most systems (and as required by the newer 2008 IEEE Standard), SoftFloat
561*9403c583SJens Wiklanderalways detects loss of accuracy for underflow as an inexact result.
562*9403c583SJens Wiklander</P>
563*9403c583SJens Wiklander
564*9403c583SJens Wiklander<H3>6.3. Rounding Precision for the <NOBR>80-Bit</NOBR> Extended Format</H3>
565*9403c583SJens Wiklander
566*9403c583SJens Wiklander<P>
567*9403c583SJens WiklanderFor <CODE>extFloat80_t</CODE> only, the rounding precision of the basic
568*9403c583SJens Wiklanderarithmetic operations is controlled by the global variable
569*9403c583SJens Wiklander<BLOCKQUOTE>
570*9403c583SJens Wiklander<CODE>uint_fast8_t extF80_roundingPrecision;</CODE>
571*9403c583SJens Wiklander</BLOCKQUOTE>
572*9403c583SJens WiklanderThe operations affected are:
573*9403c583SJens Wiklander<BLOCKQUOTE>
574*9403c583SJens Wiklander<CODE>extF80_add</CODE><BR>
575*9403c583SJens Wiklander<CODE>extF80_sub</CODE><BR>
576*9403c583SJens Wiklander<CODE>extF80_mul</CODE><BR>
577*9403c583SJens Wiklander<CODE>extF80_div</CODE><BR>
578*9403c583SJens Wiklander<CODE>extF80_sqrt</CODE>
579*9403c583SJens Wiklander</BLOCKQUOTE>
580*9403c583SJens WiklanderWhen <CODE>extF80_roundingPrecision</CODE> is set to its default value of 80,
581*9403c583SJens Wiklanderthese operations are rounded to the full precision of the <NOBR>80-bit</NOBR>
582*9403c583SJens Wiklanderdouble-extended-precision format, like occurs for other formats.
583*9403c583SJens WiklanderSetting <CODE>extF80_roundingPrecision</CODE> to 32 or to 64 causes the
584*9403c583SJens Wiklanderoperations listed to be rounded to <NOBR>32-bit</NOBR> precision (equivalent to
585*9403c583SJens Wiklander<CODE>float32_t</CODE>) or to <NOBR>64-bit</NOBR> precision (equivalent to
586*9403c583SJens Wiklander<CODE>float64_t</CODE>), respectively.
587*9403c583SJens WiklanderWhen rounding to reduced precision, additional bits in the result significand
588*9403c583SJens Wiklanderbeyond the rounding point are set to zero.
589*9403c583SJens WiklanderThe consequences of setting <CODE>extF80_roundingPrecision</CODE> to a value
590*9403c583SJens Wiklanderother than 32, 64, or 80 is not specified.
591*9403c583SJens WiklanderOperations other than the ones listed above are not affected by
592*9403c583SJens Wiklander<CODE>extF80_roundingPrecision</CODE>.
593*9403c583SJens Wiklander</P>
594*9403c583SJens Wiklander
595*9403c583SJens Wiklander
596*9403c583SJens Wiklander<H2>7. Exceptions and Exception Flags</H2>
597*9403c583SJens Wiklander
598*9403c583SJens Wiklander<P>
599*9403c583SJens WiklanderAll five exception flags required by the IEEE Floating-Point Standard are
600*9403c583SJens Wiklanderimplemented.
601*9403c583SJens WiklanderEach flag is stored as a separate bit in the global variable
602*9403c583SJens Wiklander<BLOCKQUOTE>
603*9403c583SJens Wiklander<CODE>uint_fast8_t softfloat_exceptionFlags;</CODE>
604*9403c583SJens Wiklander</BLOCKQUOTE>
605*9403c583SJens WiklanderThe positions of the exception flag bits within this variable are determined by
606*9403c583SJens Wiklanderthe bit masks
607*9403c583SJens Wiklander<BLOCKQUOTE>
608*9403c583SJens Wiklander<CODE>softfloat_flag_inexact</CODE><BR>
609*9403c583SJens Wiklander<CODE>softfloat_flag_underflow</CODE><BR>
610*9403c583SJens Wiklander<CODE>softfloat_flag_overflow</CODE><BR>
611*9403c583SJens Wiklander<CODE>softfloat_flag_infinite</CODE><BR>
612*9403c583SJens Wiklander<CODE>softfloat_flag_invalid</CODE>
613*9403c583SJens Wiklander</BLOCKQUOTE>
614*9403c583SJens WiklanderVariable <CODE>softfloat_exceptionFlags</CODE> is initialized to all zeros,
615*9403c583SJens Wiklandermeaning no exceptions.
616*9403c583SJens Wiklander</P>
617*9403c583SJens Wiklander
618*9403c583SJens Wiklander<P>
619*9403c583SJens WiklanderAn individual exception flag can be cleared with the statement
620*9403c583SJens Wiklander<BLOCKQUOTE>
621*9403c583SJens Wiklander<CODE>softfloat_exceptionFlags &= ~softfloat_flag_&lt;<I>exception</I>&gt;;</CODE>
622*9403c583SJens Wiklander</BLOCKQUOTE>
623*9403c583SJens Wiklanderwhere <CODE>&lt;<I>exception</I>&gt;</CODE> is the appropriate name.
624*9403c583SJens WiklanderTo raise a floating-point exception, function <CODE>softfloat_raise</CODE>
625*9403c583SJens Wiklandershould normally be used.
626*9403c583SJens Wiklander</P>
627*9403c583SJens Wiklander
628*9403c583SJens Wiklander<P>
629*9403c583SJens WiklanderWhen SoftFloat detects an exception other than <I>inexact</I>, it calls
630*9403c583SJens Wiklander<CODE>softfloat_raise</CODE>.
631*9403c583SJens WiklanderThe default version of this function simply raises the corresponding exception
632*9403c583SJens Wiklanderflags.
633*9403c583SJens WiklanderParticular ports of SoftFloat may support alternate behavior, such as exception
634*9403c583SJens Wiklandertraps, by modifying the default <CODE>softfloat_raise</CODE>.
635*9403c583SJens WiklanderA program may also supply its own <CODE>softfloat_raise</CODE> function to
636*9403c583SJens Wiklanderoverride the one from the SoftFloat library.
637*9403c583SJens Wiklander</P>
638*9403c583SJens Wiklander
639*9403c583SJens Wiklander<P>
640*9403c583SJens WiklanderBecause inexact results occur frequently under most circumstances (and thus are
641*9403c583SJens Wiklanderhardly exceptional), SoftFloat does not ordinarily call
642*9403c583SJens Wiklander<CODE>softfloat_raise</CODE> for <I>inexact</I> exceptions.
643*9403c583SJens WiklanderIt does always raise the <I>inexact</I> exception flag as required.
644*9403c583SJens Wiklander</P>
645*9403c583SJens Wiklander
646*9403c583SJens Wiklander
647*9403c583SJens Wiklander<H2>8. Function Details</H2>
648*9403c583SJens Wiklander
649*9403c583SJens Wiklander<P>
650*9403c583SJens WiklanderIn this section, <CODE>&lt;<I>float</I>&gt;</CODE> appears in function names as
651*9403c583SJens Wiklandera substitute for one of these abbreviations:
652*9403c583SJens Wiklander<BLOCKQUOTE>
653*9403c583SJens Wiklander<TABLE CELLSPACING=0 CELLPADDING=0>
654*9403c583SJens Wiklander<TR>
655*9403c583SJens Wiklander<TD><CODE>f32</CODE></TD>
656*9403c583SJens Wiklander<TD>indicates <CODE>float32_t</CODE>, passed by value</TD>
657*9403c583SJens Wiklander</TR>
658*9403c583SJens Wiklander<TR>
659*9403c583SJens Wiklander<TD><CODE>f64</CODE></TD>
660*9403c583SJens Wiklander<TD>indicates <CODE>float64_t</CODE>, passed by value</TD>
661*9403c583SJens Wiklander</TR>
662*9403c583SJens Wiklander<TR>
663*9403c583SJens Wiklander<TD><CODE>extF80M&nbsp;&nbsp;&nbsp;</CODE></TD>
664*9403c583SJens Wiklander<TD>indicates <CODE>extFloat80_t</CODE>, passed indirectly via pointers</TD>
665*9403c583SJens Wiklander</TR>
666*9403c583SJens Wiklander<TR>
667*9403c583SJens Wiklander<TD><CODE>extF80</CODE></TD>
668*9403c583SJens Wiklander<TD>indicates <CODE>extFloat80_t</CODE>, passed by value</TD>
669*9403c583SJens Wiklander</TR>
670*9403c583SJens Wiklander<TR>
671*9403c583SJens Wiklander<TD><CODE>f128M</CODE></TD>
672*9403c583SJens Wiklander<TD>indicates <CODE>float128_t</CODE>, passed indirectly via pointers</TD>
673*9403c583SJens Wiklander</TR>
674*9403c583SJens Wiklander<TR>
675*9403c583SJens Wiklander<TD><CODE>f128</CODE></TD>
676*9403c583SJens Wiklander<TD>indicates <CODE>float128_t</CODE>, passed by value</TD>
677*9403c583SJens Wiklander</TR>
678*9403c583SJens Wiklander</TABLE>
679*9403c583SJens Wiklander</BLOCKQUOTE>
680*9403c583SJens WiklanderThe circumstances under which values of floating-point types
681*9403c583SJens Wiklander<CODE>extFloat80_t</CODE> and <CODE>float128_t</CODE> may be passed either by
682*9403c583SJens Wiklandervalue or indirectly via pointers was discussed earlier in
683*9403c583SJens Wiklander<NOBR>section 4.5</NOBR>, <I>Conventions for Passing Arguments and Results</I>.
684*9403c583SJens Wiklander</P>
685*9403c583SJens Wiklander
686*9403c583SJens Wiklander<H3>8.1. Conversions from Integer to Floating-Point</H3>
687*9403c583SJens Wiklander
688*9403c583SJens Wiklander<P>
689*9403c583SJens WiklanderAll conversions from a <NOBR>32-bit</NOBR> or <NOBR>64-bit</NOBR> integer,
690*9403c583SJens Wiklandersigned or unsigned, to a floating-point format are supported.
691*9403c583SJens WiklanderFunctions performing these conversions have these names:
692*9403c583SJens Wiklander<BLOCKQUOTE>
693*9403c583SJens Wiklander<CODE>ui32_to_&lt;<I>float</I>&gt;</CODE><BR>
694*9403c583SJens Wiklander<CODE>ui64_to_&lt;<I>float</I>&gt;</CODE><BR>
695*9403c583SJens Wiklander<CODE>i32_to_&lt;<I>float</I>&gt;</CODE><BR>
696*9403c583SJens Wiklander<CODE>i64_to_&lt;<I>float</I>&gt;</CODE>
697*9403c583SJens Wiklander</BLOCKQUOTE>
698*9403c583SJens WiklanderConversions from <NOBR>32-bit</NOBR> integers to <NOBR>64-bit</NOBR>
699*9403c583SJens Wiklanderdouble-precision and larger formats are always exact, and likewise conversions
700*9403c583SJens Wiklanderfrom <NOBR>64-bit</NOBR> integers to <NOBR>80-bit</NOBR>
701*9403c583SJens Wiklanderdouble-extended-precision and <NOBR>128-bit</NOBR> quadruple-precision are also
702*9403c583SJens Wiklanderalways exact.
703*9403c583SJens Wiklander</P>
704*9403c583SJens Wiklander
705*9403c583SJens Wiklander<P>
706*9403c583SJens WiklanderEach conversion function takes one input of the appropriate type and generates
707*9403c583SJens Wiklanderone output.
708*9403c583SJens WiklanderThe following illustrates the signatures of these functions in cases when the
709*9403c583SJens Wiklanderfloating-point result is passed either by value or via pointers:
710*9403c583SJens Wiklander<BLOCKQUOTE>
711*9403c583SJens Wiklander<PRE>
712*9403c583SJens Wiklanderfloat64_t i32_to_f64( int32_t <I>a</I> );
713*9403c583SJens Wiklander</PRE>
714*9403c583SJens Wiklander<PRE>
715*9403c583SJens Wiklandervoid i32_to_f128M( int32_t <I>a</I>, float128_t *<I>destPtr</I> );
716*9403c583SJens Wiklander</PRE>
717*9403c583SJens Wiklander</BLOCKQUOTE>
718*9403c583SJens Wiklander</P>
719*9403c583SJens Wiklander
720*9403c583SJens Wiklander<H3>8.2. Conversions from Floating-Point to Integer</H3>
721*9403c583SJens Wiklander
722*9403c583SJens Wiklander<P>
723*9403c583SJens WiklanderConversions from a floating-point format to a <NOBR>32-bit</NOBR> or
724*9403c583SJens Wiklander<NOBR>64-bit</NOBR> integer, signed or unsigned, are supported with these
725*9403c583SJens Wiklanderfunctions:
726*9403c583SJens Wiklander<BLOCKQUOTE>
727*9403c583SJens Wiklander<CODE>&lt;<I>float</I>&gt;_to_ui32</CODE><BR>
728*9403c583SJens Wiklander<CODE>&lt;<I>float</I>&gt;_to_ui64</CODE><BR>
729*9403c583SJens Wiklander<CODE>&lt;<I>float</I>&gt;_to_i32</CODE><BR>
730*9403c583SJens Wiklander<CODE>&lt;<I>float</I>&gt;_to_i64</CODE>
731*9403c583SJens Wiklander</BLOCKQUOTE>
732*9403c583SJens WiklanderThe functions have signatures as follows, depending on whether the
733*9403c583SJens Wiklanderfloating-point input is passed by value or via pointers:
734*9403c583SJens Wiklander<BLOCKQUOTE>
735*9403c583SJens Wiklander<PRE>
736*9403c583SJens Wiklanderint_fast32_t f64_to_i32( float64_t <I>a</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> );
737*9403c583SJens Wiklander</PRE>
738*9403c583SJens Wiklander<PRE>
739*9403c583SJens Wiklanderint_fast32_t
740*9403c583SJens Wiklander f128M_to_i32( const float128_t *<I>aPtr</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> );
741*9403c583SJens Wiklander</PRE>
742*9403c583SJens Wiklander</BLOCKQUOTE>
743*9403c583SJens WiklanderThe <CODE><I>roundingMode</I></CODE> argument specifies the rounding mode for
744*9403c583SJens Wiklanderthe conversion.
745*9403c583SJens WiklanderThe variable that usually indicates rounding mode,
746*9403c583SJens Wiklander<CODE>softfloat_roundingMode</CODE>, is ignored.
747*9403c583SJens WiklanderArgument <CODE><I>exact</I></CODE> determines whether the <I>inexact</I>
748*9403c583SJens Wiklanderexception flag is raised if the conversion is not exact.
749*9403c583SJens WiklanderIf <CODE><I>exact</I></CODE> is <CODE>true</CODE>, the <I>inexact</I> flag may
750*9403c583SJens Wiklanderbe raised;
751*9403c583SJens Wiklanderotherwise, it will not be, even if the conversion is inexact.
752*9403c583SJens Wiklander</P>
753*9403c583SJens Wiklander
754*9403c583SJens Wiklander<P>
755*9403c583SJens WiklanderConversions from floating-point to integer raise the <I>invalid</I> exception
756*9403c583SJens Wiklanderif the source value cannot be rounded to a representable integer of the desired
757*9403c583SJens Wiklandersize (32 or 64 bits).
758*9403c583SJens WiklanderIn such a circumstance, if the floating-point input is a NaN or if the
759*9403c583SJens Wiklanderconversion is to an unsigned integer type, the largest positive integer is
760*9403c583SJens Wiklanderreturned;
761*9403c583SJens Wiklanderotherwise, the largest integer with the same sign as the input is returned.
762*9403c583SJens WiklanderThe functions that convert to integer types never raise the <I>overflow</I>
763*9403c583SJens Wiklanderexception.
764*9403c583SJens Wiklander</P>
765*9403c583SJens Wiklander
766*9403c583SJens Wiklander<P>
767*9403c583SJens WiklanderNote that, when converting to an unsigned integer type, if the <I>invalid</I>
768*9403c583SJens Wiklanderexception is raised because the input floating-point value would round to a
769*9403c583SJens Wiklandernegative integer, the value returned is the <EM>maximum positive unsigned
770*9403c583SJens Wiklanderinteger</EM>.
771*9403c583SJens WiklanderZero is not returned when the <I>invalid</I> exception is raised, even when
772*9403c583SJens Wiklanderzero is the closest integer to the original floating-point value.
773*9403c583SJens Wiklander</P>
774*9403c583SJens Wiklander
775*9403c583SJens Wiklander<P>
776*9403c583SJens WiklanderBecause languages such <NOBR>as C</NOBR> require that conversions to integers
777*9403c583SJens Wiklanderbe rounded toward zero, the following functions are provided for improved speed
778*9403c583SJens Wiklanderand convenience:
779*9403c583SJens Wiklander<BLOCKQUOTE>
780*9403c583SJens Wiklander<CODE>&lt;<I>float</I>&gt;_to_ui32_r_minMag</CODE><BR>
781*9403c583SJens Wiklander<CODE>&lt;<I>float</I>&gt;_to_ui64_r_minMag</CODE><BR>
782*9403c583SJens Wiklander<CODE>&lt;<I>float</I>&gt;_to_i32_r_minMag</CODE><BR>
783*9403c583SJens Wiklander<CODE>&lt;<I>float</I>&gt;_to_i64_r_minMag</CODE>
784*9403c583SJens Wiklander</BLOCKQUOTE>
785*9403c583SJens WiklanderThese functions round only toward zero (to minimum magnitude).
786*9403c583SJens WiklanderThe signatures for these functions are the same as above without the redundant
787*9403c583SJens Wiklander<CODE><I>roundingMode</I></CODE> argument:
788*9403c583SJens Wiklander<BLOCKQUOTE>
789*9403c583SJens Wiklander<PRE>
790*9403c583SJens Wiklanderint_fast32_t f64_to_i32_r_minMag( float64_t <I>a</I>, bool <I>exact</I> );
791*9403c583SJens Wiklander</PRE>
792*9403c583SJens Wiklander<PRE>
793*9403c583SJens Wiklanderint_fast32_t f128M_to_i32_r_minMag( const float128_t *<I>aPtr</I>, bool <I>exact</I> );
794*9403c583SJens Wiklander</PRE>
795*9403c583SJens Wiklander</BLOCKQUOTE>
796*9403c583SJens Wiklander</P>
797*9403c583SJens Wiklander
798*9403c583SJens Wiklander<H3>8.3. Conversions Among Floating-Point Types</H3>
799*9403c583SJens Wiklander
800*9403c583SJens Wiklander<P>
801*9403c583SJens WiklanderConversions between floating-point formats are done by functions with these
802*9403c583SJens Wiklandernames:
803*9403c583SJens Wiklander<BLOCKQUOTE>
804*9403c583SJens Wiklander<CODE>&lt;<I>float</I>&gt;_to_&lt;<I>float</I>&gt;</CODE>
805*9403c583SJens Wiklander</BLOCKQUOTE>
806*9403c583SJens WiklanderAll combinations of source and result type are supported where the source and
807*9403c583SJens Wiklanderresult are different formats.
808*9403c583SJens WiklanderThere are four different styles of signature for these functions, depending on
809*9403c583SJens Wiklanderwhether the input and the output floating-point values are passed by value or
810*9403c583SJens Wiklandervia pointers:
811*9403c583SJens Wiklander<BLOCKQUOTE>
812*9403c583SJens Wiklander<PRE>
813*9403c583SJens Wiklanderfloat32_t f64_to_f32( float64_t <I>a</I> );
814*9403c583SJens Wiklander</PRE>
815*9403c583SJens Wiklander<PRE>
816*9403c583SJens Wiklanderfloat32_t f128M_to_f32( const float128_t *<I>aPtr</I> );
817*9403c583SJens Wiklander</PRE>
818*9403c583SJens Wiklander<PRE>
819*9403c583SJens Wiklandervoid f32_to_f128M( float32_t <I>a</I>, float128_t *<I>destPtr</I> );
820*9403c583SJens Wiklander</PRE>
821*9403c583SJens Wiklander<PRE>
822*9403c583SJens Wiklandervoid extF80M_to_f128M( const extFloat80_t *<I>aPtr</I>, float128_t *<I>destPtr</I> );
823*9403c583SJens Wiklander</PRE>
824*9403c583SJens Wiklander</BLOCKQUOTE>
825*9403c583SJens Wiklander</P>
826*9403c583SJens Wiklander
827*9403c583SJens Wiklander<P>
828*9403c583SJens WiklanderConversions from a smaller to a larger floating-point format are always exact
829*9403c583SJens Wiklanderand so require no rounding.
830*9403c583SJens Wiklander</P>
831*9403c583SJens Wiklander
832*9403c583SJens Wiklander<H3>8.4. Basic Arithmetic Functions</H3>
833*9403c583SJens Wiklander
834*9403c583SJens Wiklander<P>
835*9403c583SJens WiklanderThe following basic arithmetic functions are provided:
836*9403c583SJens Wiklander<BLOCKQUOTE>
837*9403c583SJens Wiklander<CODE>&lt;<I>float</I>&gt;_add</CODE><BR>
838*9403c583SJens Wiklander<CODE>&lt;<I>float</I>&gt;_sub</CODE><BR>
839*9403c583SJens Wiklander<CODE>&lt;<I>float</I>&gt;_mul</CODE><BR>
840*9403c583SJens Wiklander<CODE>&lt;<I>float</I>&gt;_div</CODE><BR>
841*9403c583SJens Wiklander<CODE>&lt;<I>float</I>&gt;_sqrt</CODE>
842*9403c583SJens Wiklander</BLOCKQUOTE>
843*9403c583SJens WiklanderEach floating-point operation takes two operands, except for <CODE>sqrt</CODE>
844*9403c583SJens Wiklander(square root) which takes only one.
845*9403c583SJens WiklanderThe operands and result are all of the same floating-point format.
846*9403c583SJens WiklanderSignatures for these functions take the following forms:
847*9403c583SJens Wiklander<BLOCKQUOTE>
848*9403c583SJens Wiklander<PRE>
849*9403c583SJens Wiklanderfloat64_t f64_add( float64_t <I>a</I>, float64_t <I>b</I> );
850*9403c583SJens Wiklander</PRE>
851*9403c583SJens Wiklander<PRE>
852*9403c583SJens Wiklandervoid
853*9403c583SJens Wiklander f128M_add(
854*9403c583SJens Wiklander     const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I>, float128_t *<I>destPtr</I> );
855*9403c583SJens Wiklander</PRE>
856*9403c583SJens Wiklander<PRE>
857*9403c583SJens Wiklanderfloat64_t f64_sqrt( float64_t <I>a</I> );
858*9403c583SJens Wiklander</PRE>
859*9403c583SJens Wiklander<PRE>
860*9403c583SJens Wiklandervoid f128M_sqrt( const float128_t *<I>aPtr</I>, float128_t *<I>destPtr</I> );
861*9403c583SJens Wiklander</PRE>
862*9403c583SJens Wiklander</BLOCKQUOTE>
863*9403c583SJens WiklanderWhen floating-point values are passed indirectly through pointers, arguments
864*9403c583SJens Wiklander<CODE><I>aPtr</I></CODE> and <CODE><I>bPtr</I></CODE> point to the input
865*9403c583SJens Wiklanderoperands, and the last argument, <CODE><I>destPtr</I></CODE>, points to the
866*9403c583SJens Wiklanderlocation where the result is stored.
867*9403c583SJens Wiklander</P>
868*9403c583SJens Wiklander
869*9403c583SJens Wiklander<P>
870*9403c583SJens WiklanderRounding of the <NOBR>80-bit</NOBR> double-extended-precision
871*9403c583SJens Wiklander(<CODE>extFloat80_t</CODE>) functions is affected by variable
872*9403c583SJens Wiklander<CODE>extF80_roundingPrecision</CODE>, as explained earlier in
873*9403c583SJens Wiklander<NOBR>section 6.3</NOBR>,
874*9403c583SJens Wiklander<I>Rounding Precision for the <NOBR>80-Bit</NOBR> Extended Format</I>.
875*9403c583SJens Wiklander</P>
876*9403c583SJens Wiklander
877*9403c583SJens Wiklander<H3>8.5. Fused Multiply-Add Functions</H3>
878*9403c583SJens Wiklander
879*9403c583SJens Wiklander<P>
880*9403c583SJens WiklanderThe 2008 version of the IEEE Floating-Point Standard defines a <I>fused
881*9403c583SJens Wiklandermultiply-add</I> operation that does a combined multiplication and addition
882*9403c583SJens Wiklanderwith only a single rounding.
883*9403c583SJens WiklanderSoftFloat implements fused multiply-add with functions
884*9403c583SJens Wiklander<BLOCKQUOTE>
885*9403c583SJens Wiklander<CODE>&lt;<I>float</I>&gt;_mulAdd</CODE>
886*9403c583SJens Wiklander</BLOCKQUOTE>
887*9403c583SJens WiklanderUnlike other operations, fused multiple-add is supported only for the
888*9403c583SJens Wiklandernon-extended formats, <CODE>float32_t</CODE>, <CODE>float64_t</CODE>, and
889*9403c583SJens Wiklander<CODE>float128_t</CODE>.
890*9403c583SJens WiklanderNo fused multiple-add function is currently provided for the
891*9403c583SJens Wiklander<NOBR>80-bit</NOBR> double-extended-precision type, <CODE>extFloat80_t</CODE>.
892*9403c583SJens Wiklander</P>
893*9403c583SJens Wiklander
894*9403c583SJens Wiklander<P>
895*9403c583SJens WiklanderDepending on whether floating-point values are passed by value or via pointers,
896*9403c583SJens Wiklanderthe fused multiply-add functions have signatures of these forms:
897*9403c583SJens Wiklander<BLOCKQUOTE>
898*9403c583SJens Wiklander<PRE>
899*9403c583SJens Wiklanderfloat64_t f64_mulAdd( float64_t <I>a</I>, float64_t <I>b</I>, float64_t <I>c</I> );
900*9403c583SJens Wiklander</PRE>
901*9403c583SJens Wiklander<PRE>
902*9403c583SJens Wiklandervoid
903*9403c583SJens Wiklander f128M_mulAdd(
904*9403c583SJens Wiklander     const float128_t *<I>aPtr</I>,
905*9403c583SJens Wiklander     const float128_t *<I>bPtr</I>,
906*9403c583SJens Wiklander     const float128_t *<I>cPtr</I>,
907*9403c583SJens Wiklander     float128_t *<I>destPtr</I>
908*9403c583SJens Wiklander );
909*9403c583SJens Wiklander</PRE>
910*9403c583SJens Wiklander</BLOCKQUOTE>
911*9403c583SJens WiklanderThe functions compute
912*9403c583SJens Wiklander<NOBR>(<CODE><I>a</I></CODE> &times; <CODE><I>b</I></CODE>)
913*9403c583SJens Wiklander + <CODE><I>c</I></CODE></NOBR>
914*9403c583SJens Wiklanderwith a single rounding.
915*9403c583SJens WiklanderWhen floating-point values are passed indirectly through pointers, arguments
916*9403c583SJens Wiklander<CODE><I>aPtr</I></CODE>, <CODE><I>bPtr</I></CODE>, and
917*9403c583SJens Wiklander<CODE><I>cPtr</I></CODE> point to operands <CODE><I>a</I></CODE>,
918*9403c583SJens Wiklander<CODE><I>b</I></CODE>, and <CODE><I>c</I></CODE> respectively, and
919*9403c583SJens Wiklander<CODE><I>destPtr</I></CODE> points to the location where the result is stored.
920*9403c583SJens Wiklander</P>
921*9403c583SJens Wiklander
922*9403c583SJens Wiklander<P>
923*9403c583SJens WiklanderIf one of the multiplication operands <CODE><I>a</I></CODE> and
924*9403c583SJens Wiklander<CODE><I>b</I></CODE> is infinite and the other is zero, these functions raise
925*9403c583SJens Wiklanderthe invalid exception even if operand <CODE><I>c</I></CODE> is a quiet NaN.
926*9403c583SJens Wiklander</P>
927*9403c583SJens Wiklander
928*9403c583SJens Wiklander<H3>8.6. Remainder Functions</H3>
929*9403c583SJens Wiklander
930*9403c583SJens Wiklander<P>
931*9403c583SJens WiklanderFor each format, SoftFloat implements the remainder operation defined by the
932*9403c583SJens WiklanderIEEE Floating-Point Standard.
933*9403c583SJens WiklanderThe remainder functions have names
934*9403c583SJens Wiklander<BLOCKQUOTE>
935*9403c583SJens Wiklander<CODE>&lt;<I>float</I>&gt;_rem</CODE>
936*9403c583SJens Wiklander</BLOCKQUOTE>
937*9403c583SJens WiklanderEach remainder operation takes two floating-point operands of the same format
938*9403c583SJens Wiklanderand returns a result in the same format.
939*9403c583SJens WiklanderDepending on whether floating-point values are passed by value or via pointers,
940*9403c583SJens Wiklanderthe remainder functions have signatures of these forms:
941*9403c583SJens Wiklander<BLOCKQUOTE>
942*9403c583SJens Wiklander<PRE>
943*9403c583SJens Wiklanderfloat64_t f64_rem( float64_t <I>a</I>, float64_t <I>b</I> );
944*9403c583SJens Wiklander</PRE>
945*9403c583SJens Wiklander<PRE>
946*9403c583SJens Wiklandervoid
947*9403c583SJens Wiklander f128M_rem(
948*9403c583SJens Wiklander     const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I>, float128_t *<I>destPtr</I> );
949*9403c583SJens Wiklander</PRE>
950*9403c583SJens Wiklander</BLOCKQUOTE>
951*9403c583SJens WiklanderWhen floating-point values are passed indirectly through pointers, arguments
952*9403c583SJens Wiklander<CODE><I>aPtr</I></CODE> and <CODE><I>bPtr</I></CODE> point to operands
953*9403c583SJens Wiklander<CODE><I>a</I></CODE> and <CODE><I>b</I></CODE> respectively, and
954*9403c583SJens Wiklander<CODE><I>destPtr</I></CODE> points to the location where the result is stored.
955*9403c583SJens Wiklander</P>
956*9403c583SJens Wiklander
957*9403c583SJens Wiklander<P>
958*9403c583SJens WiklanderThe IEEE Standard remainder operation computes the value
959*9403c583SJens Wiklander<NOBR><CODE><I>a</I></CODE>
960*9403c583SJens Wiklander &minus; <I>n</I> &times; <CODE><I>b</I></CODE></NOBR>,
961*9403c583SJens Wiklanderwhere <I>n</I> is the integer closest to
962*9403c583SJens Wiklander<NOBR><CODE><I>a</I></CODE> &divide; <CODE><I>b</I></CODE></NOBR>.
963*9403c583SJens WiklanderIf <NOBR><CODE><I>a</I></CODE> &divide; <CODE><I>b</I></CODE></NOBR> is exactly
964*9403c583SJens Wiklanderhalfway between two integers, <I>n</I> is the <EM>even</EM> integer closest to
965*9403c583SJens Wiklander<NOBR><CODE><I>a</I></CODE> &divide; <CODE><I>b</I></CODE></NOBR>.
966*9403c583SJens WiklanderThe IEEE Standard&rsquo;s remainder operation is always exact and so requires
967*9403c583SJens Wiklanderno rounding.
968*9403c583SJens Wiklander</P>
969*9403c583SJens Wiklander
970*9403c583SJens Wiklander<P>
971*9403c583SJens WiklanderDepending on the relative magnitudes of the operands, the remainder
972*9403c583SJens Wiklanderfunctions can take considerably longer to execute than the other SoftFloat
973*9403c583SJens Wiklanderfunctions.
974*9403c583SJens WiklanderThis is inherent in the remainder operation itself and is not a flaw in the
975*9403c583SJens WiklanderSoftFloat implementation.
976*9403c583SJens Wiklander</P>
977*9403c583SJens Wiklander
978*9403c583SJens Wiklander<H3>8.7. Round-to-Integer Functions</H3>
979*9403c583SJens Wiklander
980*9403c583SJens Wiklander<P>
981*9403c583SJens WiklanderFor each format, SoftFloat implements the round-to-integer operation specified
982*9403c583SJens Wiklanderby the IEEE Floating-Point Standard.
983*9403c583SJens WiklanderThese functions are named
984*9403c583SJens Wiklander<BLOCKQUOTE>
985*9403c583SJens Wiklander<CODE>&lt;<I>float</I>&gt;_roundToInt</CODE>
986*9403c583SJens Wiklander</BLOCKQUOTE>
987*9403c583SJens WiklanderEach round-to-integer operation takes a single floating-point operand.
988*9403c583SJens WiklanderThis operand is rounded to an integer according to a specified rounding mode,
989*9403c583SJens Wiklanderand the resulting integer value is returned in the same floating-point format.
990*9403c583SJens Wiklander(Note that the result is not an integer type.)
991*9403c583SJens Wiklander</P>
992*9403c583SJens Wiklander
993*9403c583SJens Wiklander<P>
994*9403c583SJens WiklanderThe signatures of the round-to-integer functions are similar to those for
995*9403c583SJens Wiklanderconversions to an integer type:
996*9403c583SJens Wiklander<BLOCKQUOTE>
997*9403c583SJens Wiklander<PRE>
998*9403c583SJens Wiklanderfloat64_t f64_roundToInt( float64_t <I>a</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> );
999*9403c583SJens Wiklander</PRE>
1000*9403c583SJens Wiklander<PRE>
1001*9403c583SJens Wiklandervoid
1002*9403c583SJens Wiklander f128M_roundToInt(
1003*9403c583SJens Wiklander     const float128_t *<I>aPtr</I>,
1004*9403c583SJens Wiklander     uint_fast8_t <I>roundingMode</I>,
1005*9403c583SJens Wiklander     bool <I>exact</I>,
1006*9403c583SJens Wiklander     float128_t *<I>destPtr</I>
1007*9403c583SJens Wiklander );
1008*9403c583SJens Wiklander</PRE>
1009*9403c583SJens Wiklander</BLOCKQUOTE>
1010*9403c583SJens WiklanderThe <CODE><I>roundingMode</I></CODE> argument specifies the rounding mode to
1011*9403c583SJens Wiklanderapply.
1012*9403c583SJens WiklanderThe variable that usually indicates rounding mode,
1013*9403c583SJens Wiklander<CODE>softfloat_roundingMode</CODE>, is ignored.
1014*9403c583SJens WiklanderArgument <CODE><I>exact</I></CODE> determines whether the <I>inexact</I>
1015*9403c583SJens Wiklanderexception flag is raised if the conversion is not exact.
1016*9403c583SJens WiklanderIf <CODE><I>exact</I></CODE> is <CODE>true</CODE>, the <I>inexact</I> flag may
1017*9403c583SJens Wiklanderbe raised;
1018*9403c583SJens Wiklanderotherwise, it will not be, even if the conversion is inexact.
1019*9403c583SJens WiklanderWhen floating-point values are passed indirectly through pointers,
1020*9403c583SJens Wiklander<CODE><I>aPtr</I></CODE> points to the input operand and
1021*9403c583SJens Wiklander<CODE><I>destPtr</I></CODE> points to the location where the result is stored.
1022*9403c583SJens Wiklander</P>
1023*9403c583SJens Wiklander
1024*9403c583SJens Wiklander<H3>8.8. Comparison Functions</H3>
1025*9403c583SJens Wiklander
1026*9403c583SJens Wiklander<P>
1027*9403c583SJens WiklanderFor each format, the following floating-point comparison functions are
1028*9403c583SJens Wiklanderprovided:
1029*9403c583SJens Wiklander<BLOCKQUOTE>
1030*9403c583SJens Wiklander<CODE>&lt;<I>float</I>&gt;_eq</CODE><BR>
1031*9403c583SJens Wiklander<CODE>&lt;<I>float</I>&gt;_le</CODE><BR>
1032*9403c583SJens Wiklander<CODE>&lt;<I>float</I>&gt;_lt</CODE>
1033*9403c583SJens Wiklander</BLOCKQUOTE>
1034*9403c583SJens WiklanderEach comparison takes two operands of the same type and returns a Boolean.
1035*9403c583SJens WiklanderThe abbreviation <CODE>eq</CODE> stands for &ldquo;equal&rdquo; (=);
1036*9403c583SJens Wiklander<CODE>le</CODE> stands for &ldquo;less than or equal&rdquo; (&le;);
1037*9403c583SJens Wiklanderand <CODE>lt</CODE> stands for &ldquo;less than&rdquo; (&lt;).
1038*9403c583SJens WiklanderDepending on whether the floating-point operands are passed by value or via
1039*9403c583SJens Wiklanderpointers, the comparison functions have signatures of these forms:
1040*9403c583SJens Wiklander<BLOCKQUOTE>
1041*9403c583SJens Wiklander<PRE>
1042*9403c583SJens Wiklanderbool f64_eq( float64_t <I>a</I>, float64_t <I>b</I> );
1043*9403c583SJens Wiklander</PRE>
1044*9403c583SJens Wiklander<PRE>
1045*9403c583SJens Wiklanderbool f128M_eq( const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I> );
1046*9403c583SJens Wiklander</PRE>
1047*9403c583SJens Wiklander</BLOCKQUOTE>
1048*9403c583SJens Wiklander</P>
1049*9403c583SJens Wiklander
1050*9403c583SJens Wiklander<P>
1051*9403c583SJens WiklanderThe usual greater-than (&gt;), greater-than-or-equal (&ge;), and not-equal
1052*9403c583SJens Wiklander(&ne;) comparisons are easily obtained from the functions provided.
1053*9403c583SJens WiklanderThe not-equal function is just the logical complement of the equal function.
1054*9403c583SJens WiklanderThe greater-than-or-equal function is identical to the less-than-or-equal
1055*9403c583SJens Wiklanderfunction with the arguments in reverse order, and likewise the greater-than
1056*9403c583SJens Wiklanderfunction is identical to the less-than function with the arguments reversed.
1057*9403c583SJens Wiklander</P>
1058*9403c583SJens Wiklander
1059*9403c583SJens Wiklander<P>
1060*9403c583SJens WiklanderThe IEEE Floating-Point Standard specifies that the less-than-or-equal and
1061*9403c583SJens Wiklanderless-than comparisons by default raise the <I>invalid</I> exception if either
1062*9403c583SJens Wiklanderoperand is any kind of NaN.
1063*9403c583SJens WiklanderEquality comparisons, on the other hand, are defined by default to raise the
1064*9403c583SJens Wiklander<I>invalid</I> exception only for signaling NaNs, not quiet NaNs.
1065*9403c583SJens WiklanderFor completeness, SoftFloat provides these complementary functions:
1066*9403c583SJens Wiklander<BLOCKQUOTE>
1067*9403c583SJens Wiklander<CODE>&lt;<I>float</I>&gt;_eq_signaling</CODE><BR>
1068*9403c583SJens Wiklander<CODE>&lt;<I>float</I>&gt;_le_quiet</CODE><BR>
1069*9403c583SJens Wiklander<CODE>&lt;<I>float</I>&gt;_lt_quiet</CODE>
1070*9403c583SJens Wiklander</BLOCKQUOTE>
1071*9403c583SJens WiklanderThe <CODE>signaling</CODE> equality comparisons are identical to the default
1072*9403c583SJens Wiklanderequality comparisons except that the <I>invalid</I> exception is raised for any
1073*9403c583SJens WiklanderNaN input, not just for signaling NaNs.
1074*9403c583SJens WiklanderSimilarly, the <CODE>quiet</CODE> comparison functions are identical to their
1075*9403c583SJens Wiklanderdefault counterparts except that the <I>invalid</I> exception is not raised for
1076*9403c583SJens Wiklanderquiet NaNs.
1077*9403c583SJens Wiklander</P>
1078*9403c583SJens Wiklander
1079*9403c583SJens Wiklander<H3>8.9. Signaling NaN Test Functions</H3>
1080*9403c583SJens Wiklander
1081*9403c583SJens Wiklander<P>
1082*9403c583SJens WiklanderFunctions for testing whether a floating-point value is a signaling NaN are
1083*9403c583SJens Wiklanderprovided with these names:
1084*9403c583SJens Wiklander<BLOCKQUOTE>
1085*9403c583SJens Wiklander<CODE>&lt;<I>float</I>&gt;_isSignalingNaN</CODE>
1086*9403c583SJens Wiklander</BLOCKQUOTE>
1087*9403c583SJens WiklanderThe functions take one floating-point operand and return a Boolean indicating
1088*9403c583SJens Wiklanderwhether the operand is a signaling NaN.
1089*9403c583SJens WiklanderAccordingly, the functions have the forms
1090*9403c583SJens Wiklander<BLOCKQUOTE>
1091*9403c583SJens Wiklander<PRE>
1092*9403c583SJens Wiklanderbool f64_isSignalingNaN( float64_t <I>a</I> );
1093*9403c583SJens Wiklander</PRE>
1094*9403c583SJens Wiklander<PRE>
1095*9403c583SJens Wiklanderbool f128M_isSignalingNaN( const float128_t *<I>aPtr</I> );
1096*9403c583SJens Wiklander</PRE>
1097*9403c583SJens Wiklander</BLOCKQUOTE>
1098*9403c583SJens Wiklander</P>
1099*9403c583SJens Wiklander
1100*9403c583SJens Wiklander<H3>8.10. Raise-Exception Function</H3>
1101*9403c583SJens Wiklander
1102*9403c583SJens Wiklander<P>
1103*9403c583SJens WiklanderSoftFloat provides a single function for raising floating-point exceptions:
1104*9403c583SJens Wiklander<BLOCKQUOTE>
1105*9403c583SJens Wiklander<PRE>
1106*9403c583SJens Wiklandervoid softfloat_raise( uint_fast8_t <I>exceptions</I> );
1107*9403c583SJens Wiklander</PRE>
1108*9403c583SJens Wiklander</BLOCKQUOTE>
1109*9403c583SJens WiklanderThe <CODE><I>exceptions</I></CODE> argument is a mask indicating the set of
1110*9403c583SJens Wiklanderexceptions to raise.
1111*9403c583SJens Wiklander(See earlier section 7, <I>Exceptions and Exception Flags</I>.)
1112*9403c583SJens WiklanderIn addition to setting the specified exception flags in variable
1113*9403c583SJens Wiklander<CODE>softfloat_exceptionFlags</CODE>, the <CODE>softfloat_raise</CODE>
1114*9403c583SJens Wiklanderfunction may cause a trap or abort appropriate for the current system.
1115*9403c583SJens Wiklander</P>
1116*9403c583SJens Wiklander
1117*9403c583SJens Wiklander
1118*9403c583SJens Wiklander<H2>9. Changes from SoftFloat <NOBR>Release 2</NOBR></H2>
1119*9403c583SJens Wiklander
1120*9403c583SJens Wiklander<P>
1121*9403c583SJens WiklanderApart from a change in the legal use license, <NOBR>Release 3</NOBR> of
1122*9403c583SJens WiklanderSoftFloat introduced numerous technical differences compared to earlier
1123*9403c583SJens Wiklanderreleases.
1124*9403c583SJens Wiklander</P>
1125*9403c583SJens Wiklander
1126*9403c583SJens Wiklander<H3>9.1. Name Changes</H3>
1127*9403c583SJens Wiklander
1128*9403c583SJens Wiklander<P>
1129*9403c583SJens WiklanderThe most obvious and pervasive difference compared to <NOBR>Release 2</NOBR>
1130*9403c583SJens Wiklanderis that the names of most functions and variables have changed, even when the
1131*9403c583SJens Wiklanderbehavior has not.
1132*9403c583SJens WiklanderFirst, the floating-point types, the mode variables, the exception flags
1133*9403c583SJens Wiklandervariable, the function to raise exceptions, and various associated constants
1134*9403c583SJens Wiklanderhave been renamed as follows:
1135*9403c583SJens Wiklander<BLOCKQUOTE>
1136*9403c583SJens Wiklander<TABLE>
1137*9403c583SJens Wiklander<TR>
1138*9403c583SJens Wiklander<TD>old name, Release 2:</TD>
1139*9403c583SJens Wiklander<TD>new name, Release 3:</TD>
1140*9403c583SJens Wiklander</TR>
1141*9403c583SJens Wiklander<TR>
1142*9403c583SJens Wiklander<TD><CODE>float32</CODE></TD>
1143*9403c583SJens Wiklander<TD><CODE>float32_t</CODE></TD>
1144*9403c583SJens Wiklander</TR>
1145*9403c583SJens Wiklander<TR>
1146*9403c583SJens Wiklander<TD><CODE>float64</CODE></TD>
1147*9403c583SJens Wiklander<TD><CODE>float64_t</CODE></TD>
1148*9403c583SJens Wiklander</TR>
1149*9403c583SJens Wiklander<TR>
1150*9403c583SJens Wiklander<TD><CODE>floatx80</CODE></TD>
1151*9403c583SJens Wiklander<TD><CODE>extFloat80_t</CODE></TD>
1152*9403c583SJens Wiklander</TR>
1153*9403c583SJens Wiklander<TR>
1154*9403c583SJens Wiklander<TD><CODE>float128</CODE></TD>
1155*9403c583SJens Wiklander<TD><CODE>float128_t</CODE></TD>
1156*9403c583SJens Wiklander</TR>
1157*9403c583SJens Wiklander<TR>
1158*9403c583SJens Wiklander<TD><CODE>float_rounding_mode</CODE></TD>
1159*9403c583SJens Wiklander<TD><CODE>softfloat_roundingMode</CODE></TD>
1160*9403c583SJens Wiklander</TR>
1161*9403c583SJens Wiklander<TR>
1162*9403c583SJens Wiklander<TD><CODE>float_round_nearest_even</CODE></TD>
1163*9403c583SJens Wiklander<TD><CODE>softfloat_round_near_even</CODE></TD>
1164*9403c583SJens Wiklander</TR>
1165*9403c583SJens Wiklander<TR>
1166*9403c583SJens Wiklander<TD><CODE>float_round_to_zero</CODE></TD>
1167*9403c583SJens Wiklander<TD><CODE>softfloat_round_minMag</CODE></TD>
1168*9403c583SJens Wiklander</TR>
1169*9403c583SJens Wiklander<TR>
1170*9403c583SJens Wiklander<TD><CODE>float_round_down</CODE></TD>
1171*9403c583SJens Wiklander<TD><CODE>softfloat_round_min</CODE></TD>
1172*9403c583SJens Wiklander</TR>
1173*9403c583SJens Wiklander<TR>
1174*9403c583SJens Wiklander<TD><CODE>float_round_up</CODE></TD>
1175*9403c583SJens Wiklander<TD><CODE>softfloat_round_max</CODE></TD>
1176*9403c583SJens Wiklander</TR>
1177*9403c583SJens Wiklander<TR>
1178*9403c583SJens Wiklander<TD><CODE>float_detect_tininess</CODE></TD>
1179*9403c583SJens Wiklander<TD><CODE>softfloat_detectTininess</CODE></TD>
1180*9403c583SJens Wiklander</TR>
1181*9403c583SJens Wiklander<TR>
1182*9403c583SJens Wiklander<TD><CODE>float_tininess_before_rounding&nbsp;&nbsp;&nbsp;&nbsp;</CODE></TD>
1183*9403c583SJens Wiklander<TD><CODE>softfloat_tininess_beforeRounding</CODE></TD>
1184*9403c583SJens Wiklander</TR>
1185*9403c583SJens Wiklander<TR>
1186*9403c583SJens Wiklander<TD><CODE>float_tininess_after_rounding</CODE></TD>
1187*9403c583SJens Wiklander<TD><CODE>softfloat_tininess_afterRounding</CODE></TD>
1188*9403c583SJens Wiklander</TR>
1189*9403c583SJens Wiklander<TR>
1190*9403c583SJens Wiklander<TD><CODE>floatx80_rounding_precision</CODE></TD>
1191*9403c583SJens Wiklander<TD><CODE>extF80_roundingPrecision</CODE></TD>
1192*9403c583SJens Wiklander</TR>
1193*9403c583SJens Wiklander<TR>
1194*9403c583SJens Wiklander<TD><CODE>float_exception_flags</CODE></TD>
1195*9403c583SJens Wiklander<TD><CODE>softfloat_exceptionFlags</CODE></TD>
1196*9403c583SJens Wiklander</TR>
1197*9403c583SJens Wiklander<TR>
1198*9403c583SJens Wiklander<TD><CODE>float_flag_inexact</CODE></TD>
1199*9403c583SJens Wiklander<TD><CODE>softfloat_flag_inexact</CODE></TD>
1200*9403c583SJens Wiklander</TR>
1201*9403c583SJens Wiklander<TR>
1202*9403c583SJens Wiklander<TD><CODE>float_flag_underflow</CODE></TD>
1203*9403c583SJens Wiklander<TD><CODE>softfloat_flag_underflow</CODE></TD>
1204*9403c583SJens Wiklander</TR>
1205*9403c583SJens Wiklander<TR>
1206*9403c583SJens Wiklander<TD><CODE>float_flag_overflow</CODE></TD>
1207*9403c583SJens Wiklander<TD><CODE>softfloat_flag_overflow</CODE></TD>
1208*9403c583SJens Wiklander</TR>
1209*9403c583SJens Wiklander<TR>
1210*9403c583SJens Wiklander<TD><CODE>float_flag_divbyzero</CODE></TD>
1211*9403c583SJens Wiklander<TD><CODE>softfloat_flag_infinite</CODE></TD>
1212*9403c583SJens Wiklander</TR>
1213*9403c583SJens Wiklander<TR>
1214*9403c583SJens Wiklander<TD><CODE>float_flag_invalid</CODE></TD>
1215*9403c583SJens Wiklander<TD><CODE>softfloat_flag_invalid</CODE></TD>
1216*9403c583SJens Wiklander</TR>
1217*9403c583SJens Wiklander<TR>
1218*9403c583SJens Wiklander<TD><CODE>float_raise</CODE></TD>
1219*9403c583SJens Wiklander<TD><CODE>softfloat_raise</CODE></TD>
1220*9403c583SJens Wiklander</TR>
1221*9403c583SJens Wiklander</TABLE>
1222*9403c583SJens Wiklander</BLOCKQUOTE>
1223*9403c583SJens Wiklander</P>
1224*9403c583SJens Wiklander
1225*9403c583SJens Wiklander<P>
1226*9403c583SJens WiklanderFurthermore, <NOBR>Release 3</NOBR> adopted the following new abbreviations for
1227*9403c583SJens Wiklanderfunction names:
1228*9403c583SJens Wiklander<BLOCKQUOTE>
1229*9403c583SJens Wiklander<TABLE>
1230*9403c583SJens Wiklander<TR>
1231*9403c583SJens Wiklander<TD>used in names in Release 2:<CODE>&nbsp;&nbsp;&nbsp;&nbsp;</CODE></TD>
1232*9403c583SJens Wiklander<TD>used in names in Release 3:</TD>
1233*9403c583SJens Wiklander</TR>
1234*9403c583SJens Wiklander<TR> <TD><CODE>int32</CODE></TD>    <TD><CODE>i32</CODE></TD>    </TR>
1235*9403c583SJens Wiklander<TR> <TD><CODE>int64</CODE></TD>    <TD><CODE>i64</CODE></TD>    </TR>
1236*9403c583SJens Wiklander<TR> <TD><CODE>float32</CODE></TD>  <TD><CODE>f32</CODE></TD>    </TR>
1237*9403c583SJens Wiklander<TR> <TD><CODE>float64</CODE></TD>  <TD><CODE>f64</CODE></TD>    </TR>
1238*9403c583SJens Wiklander<TR> <TD><CODE>floatx80</CODE></TD> <TD><CODE>extF80</CODE></TD> </TR>
1239*9403c583SJens Wiklander<TR> <TD><CODE>float128</CODE></TD> <TD><CODE>f128</CODE></TD>   </TR>
1240*9403c583SJens Wiklander</TABLE>
1241*9403c583SJens Wiklander</BLOCKQUOTE>
1242*9403c583SJens WiklanderThus, for example, the function to add two <NOBR>32-bit</NOBR> floating-point
1243*9403c583SJens Wiklandernumbers, previously called <CODE>float32_add</CODE> in <NOBR>Release 2</NOBR>,
1244*9403c583SJens Wiklanderis now <CODE>f32_add</CODE>.
1245*9403c583SJens WiklanderLastly, there have been a few other changes to function names:
1246*9403c583SJens Wiklander<BLOCKQUOTE>
1247*9403c583SJens Wiklander<TABLE>
1248*9403c583SJens Wiklander<TR>
1249*9403c583SJens Wiklander<TD>used in names in Release 2:<CODE>&nbsp;&nbsp;&nbsp;</CODE></TD>
1250*9403c583SJens Wiklander<TD>used in names in Release 3:<CODE>&nbsp;&nbsp;&nbsp;</CODE></TD>
1251*9403c583SJens Wiklander<TD>relevant functions:</TD>
1252*9403c583SJens Wiklander</TR>
1253*9403c583SJens Wiklander<TR>
1254*9403c583SJens Wiklander<TD><CODE>_round_to_zero</CODE></TD>
1255*9403c583SJens Wiklander<TD><CODE>_r_minMag</CODE></TD>
1256*9403c583SJens Wiklander<TD>conversions from floating-point to integer (<NOBR>section 8.2</NOBR>)</TD>
1257*9403c583SJens Wiklander</TR>
1258*9403c583SJens Wiklander<TR>
1259*9403c583SJens Wiklander<TD><CODE>round_to_int</CODE></TD>
1260*9403c583SJens Wiklander<TD><CODE>roundToInt</CODE></TD>
1261*9403c583SJens Wiklander<TD>round-to-integer functions (<NOBR>section 8.7</NOBR>)</TD>
1262*9403c583SJens Wiklander</TR>
1263*9403c583SJens Wiklander<TR>
1264*9403c583SJens Wiklander<TD><CODE>is_signaling_nan&nbsp;&nbsp;&nbsp;&nbsp;</CODE></TD>
1265*9403c583SJens Wiklander<TD><CODE>isSignalingNaN</CODE></TD>
1266*9403c583SJens Wiklander<TD>signaling NaN test functions (<NOBR>section 8.9</NOBR>)</TD>
1267*9403c583SJens Wiklander</TR>
1268*9403c583SJens Wiklander</TABLE>
1269*9403c583SJens Wiklander</BLOCKQUOTE>
1270*9403c583SJens Wiklander</P>
1271*9403c583SJens Wiklander
1272*9403c583SJens Wiklander<H3>9.2. Changes to Function Arguments</H3>
1273*9403c583SJens Wiklander
1274*9403c583SJens Wiklander<P>
1275*9403c583SJens WiklanderBesides simple name changes, some operations were given a different interface
1276*9403c583SJens Wiklanderin <NOBR>Release 3</NOBR> than they had in <NOBR>Release 2</NOBR>:
1277*9403c583SJens Wiklander<UL>
1278*9403c583SJens Wiklander
1279*9403c583SJens Wiklander<LI>
1280*9403c583SJens Wiklander<P>
1281*9403c583SJens WiklanderSince <NOBR>Release 3</NOBR>, integer arguments and results of functions have
1282*9403c583SJens Wiklanderstandard types from header <CODE>&lt;stdint.h&gt;</CODE>, such as
1283*9403c583SJens Wiklander<CODE>uint32_t</CODE>, whereas previously their types could be defined
1284*9403c583SJens Wiklanderdifferently for each port of SoftFloat, usually using traditional C types such
1285*9403c583SJens Wiklanderas <CODE>unsigned</CODE> <CODE>int</CODE>.
1286*9403c583SJens WiklanderLikewise, functions in <NOBR>Release 3</NOBR> and later pass Booleans as
1287*9403c583SJens Wiklanderstandard type <CODE>bool</CODE> from <CODE>&lt;stdbool.h&gt;</CODE>, whereas
1288*9403c583SJens Wiklanderpreviously these were again passed as a port-specific type (usually
1289*9403c583SJens Wiklander<CODE>int</CODE>).
1290*9403c583SJens Wiklander</P>
1291*9403c583SJens Wiklander
1292*9403c583SJens Wiklander<LI>
1293*9403c583SJens Wiklander<P>
1294*9403c583SJens WiklanderAs explained earlier in <NOBR>section 4.5</NOBR>, <I>Conventions for Passing
1295*9403c583SJens WiklanderArguments and Results</I>, SoftFloat functions in <NOBR>Release 3</NOBR> and
1296*9403c583SJens Wiklanderlater may pass <NOBR>80-bit</NOBR> and <NOBR>128-bit</NOBR> floating-point
1297*9403c583SJens Wiklandervalues through pointers, meaning that functions take pointer arguments and then
1298*9403c583SJens Wiklanderread or write floating-point values at the locations indicated by the pointers.
1299*9403c583SJens WiklanderIn <NOBR>Release 2</NOBR>, floating-point arguments and results were always
1300*9403c583SJens Wiklanderpassed by value, regardless of their size.
1301*9403c583SJens Wiklander</P>
1302*9403c583SJens Wiklander
1303*9403c583SJens Wiklander<LI>
1304*9403c583SJens Wiklander<P>
1305*9403c583SJens WiklanderFunctions that round to an integer have additional
1306*9403c583SJens Wiklander<CODE><I>roundingMode</I></CODE> and <CODE><I>exact</I></CODE> arguments that
1307*9403c583SJens Wiklanderthey did not have in <NOBR>Release 2</NOBR>.
1308*9403c583SJens WiklanderRefer to sections 8.2 <NOBR>and 8.7</NOBR> for descriptions of these functions
1309*9403c583SJens Wiklandersince <NOBR>Release 3</NOBR>.
1310*9403c583SJens WiklanderFor <NOBR>Release 2</NOBR>, the rounding mode, when needed, was taken from the
1311*9403c583SJens Wiklandersame global variable that affects the basic arithmetic operations (now called
1312*9403c583SJens Wiklander<CODE>softfloat_roundingMode</CODE> but previously known as
1313*9403c583SJens Wiklander<CODE>float_rounding_mode</CODE>).
1314*9403c583SJens WiklanderAlso, for <NOBR>Release 2</NOBR>, if the original floating-point input was not
1315*9403c583SJens Wiklanderan exact integer value, and if the <I>invalid</I> exception was not raised by
1316*9403c583SJens Wiklanderthe function, the <I>inexact</I> exception was always raised.
1317*9403c583SJens Wiklander<NOBR>Release 2</NOBR> had no option to suppress raising <I>inexact</I> in this
1318*9403c583SJens Wiklandercase.
1319*9403c583SJens WiklanderApplications using SoftFloat <NOBR>Release 3</NOBR> or later can get the same
1320*9403c583SJens Wiklandereffect as <NOBR>Release 2</NOBR> by passing variable
1321*9403c583SJens Wiklander<CODE>softfloat_roundingMode</CODE> for argument
1322*9403c583SJens Wiklander<CODE><I>roundingMode</I></CODE> and <CODE>true</CODE> for argument
1323*9403c583SJens Wiklander<CODE><I>exact</I></CODE>.
1324*9403c583SJens Wiklander</P>
1325*9403c583SJens Wiklander
1326*9403c583SJens Wiklander</UL>
1327*9403c583SJens Wiklander</P>
1328*9403c583SJens Wiklander
1329*9403c583SJens Wiklander<H3>9.3. Added Capabilities</H3>
1330*9403c583SJens Wiklander
1331*9403c583SJens Wiklander<P>
1332*9403c583SJens WiklanderWith <NOBR>Release 3</NOBR>, some new features have been added that were not
1333*9403c583SJens Wiklanderpresent in <NOBR>Release 2</NOBR>:
1334*9403c583SJens Wiklander<UL>
1335*9403c583SJens Wiklander
1336*9403c583SJens Wiklander<LI>
1337*9403c583SJens Wiklander<P>
1338*9403c583SJens WiklanderA port of SoftFloat can now define any of the floating-point types
1339*9403c583SJens Wiklander<CODE>float32_t</CODE>, <CODE>float64_t</CODE>, <CODE>extFloat80_t</CODE>, and
1340*9403c583SJens Wiklander<CODE>float128_t</CODE> as aliases for C&rsquo;s standard floating-point types
1341*9403c583SJens Wiklander<CODE>float</CODE>, <CODE>double</CODE>, and <CODE>long</CODE>
1342*9403c583SJens Wiklander<CODE>double</CODE>, using either <CODE>#define</CODE> or <CODE>typedef</CODE>.
1343*9403c583SJens WiklanderThis potential convenience was not supported under <NOBR>Release 2</NOBR>.
1344*9403c583SJens Wiklander</P>
1345*9403c583SJens Wiklander
1346*9403c583SJens Wiklander<P>
1347*9403c583SJens Wiklander(Note, however, that there may be a performance cost to defining
1348*9403c583SJens WiklanderSoftFloat&rsquo;s floating-point types this way, depending on the platform and
1349*9403c583SJens Wiklanderthe applications using SoftFloat.
1350*9403c583SJens WiklanderPorts of SoftFloat may choose to forgo the convenience in favor of better
1351*9403c583SJens Wiklanderspeed.)
1352*9403c583SJens Wiklander</P>
1353*9403c583SJens Wiklander
1354*9403c583SJens Wiklander<P>
1355*9403c583SJens Wiklander<LI>
1356*9403c583SJens WiklanderFunctions have been added for converting between the floating-point types and
1357*9403c583SJens Wiklanderunsigned integers.
1358*9403c583SJens Wiklander<NOBR>Release 2</NOBR> supported only signed integers, not unsigned.
1359*9403c583SJens Wiklander</P>
1360*9403c583SJens Wiklander
1361*9403c583SJens Wiklander<P>
1362*9403c583SJens Wiklander<LI>
1363*9403c583SJens WiklanderA new, fifth rounding mode, <CODE>softfloat_round_near_maxMag</CODE> (round to
1364*9403c583SJens Wiklandernearest, with ties to maximum magnitude, away from zero) is now supported for
1365*9403c583SJens Wiklanderall cases involving rounding.
1366*9403c583SJens Wiklander</P>
1367*9403c583SJens Wiklander
1368*9403c583SJens Wiklander<P>
1369*9403c583SJens Wiklander<LI>
1370*9403c583SJens WiklanderFused multiply-add functions have been added for the non-extended formats,
1371*9403c583SJens Wiklander<CODE>float32_t</CODE>, <CODE>float64_t</CODE>, and <CODE>float128_t</CODE>.
1372*9403c583SJens Wiklander</P>
1373*9403c583SJens Wiklander
1374*9403c583SJens Wiklander</UL>
1375*9403c583SJens Wiklander</P>
1376*9403c583SJens Wiklander
1377*9403c583SJens Wiklander<H3>9.4. Better Compatibility with the C Language</H3>
1378*9403c583SJens Wiklander
1379*9403c583SJens Wiklander<P>
1380*9403c583SJens Wiklander<NOBR>Release 3</NOBR> of SoftFloat was written to conform better to the ISO C
1381*9403c583SJens WiklanderStandard&rsquo;s rules for portability.
1382*9403c583SJens WiklanderFor example, older releases of SoftFloat employed type conversions in ways
1383*9403c583SJens Wiklanderthat, while commonly practiced, are not fully defined by the C Standard.
1384*9403c583SJens WiklanderSuch problematic type conversions have generally been replaced by the use of
1385*9403c583SJens Wiklanderunions, the behavior around which is more strictly regulated these days.
1386*9403c583SJens Wiklander</P>
1387*9403c583SJens Wiklander
1388*9403c583SJens Wiklander<H3>9.5. New Organization as a Library</H3>
1389*9403c583SJens Wiklander
1390*9403c583SJens Wiklander<P>
1391*9403c583SJens WiklanderStarting with <NOBR>Release 3</NOBR>, SoftFloat now builds as a library.
1392*9403c583SJens WiklanderPreviously, SoftFloat compiled into a single, monolithic object file containing
1393*9403c583SJens Wiklanderall the SoftFloat functions, with the consequence that a program linking with
1394*9403c583SJens WiklanderSoftFloat would get every SoftFloat function in its binary file even if only a
1395*9403c583SJens Wiklanderfew functions were actually used.
1396*9403c583SJens WiklanderWith SoftFloat in the form of a library, a program that is linked by a standard
1397*9403c583SJens Wiklanderlinker will include only those functions of SoftFloat that it needs and no
1398*9403c583SJens Wiklanderothers.
1399*9403c583SJens Wiklander</P>
1400*9403c583SJens Wiklander
1401*9403c583SJens Wiklander<H3>9.6. Optimization Gains (and Losses)</H3>
1402*9403c583SJens Wiklander
1403*9403c583SJens Wiklander<P>
1404*9403c583SJens WiklanderIndividual SoftFloat functions have been variously improved in
1405*9403c583SJens Wiklander<NOBR>Release 3</NOBR> compared to earlier releases.
1406*9403c583SJens WiklanderIn particular, better, faster algorithms have been deployed for the operations
1407*9403c583SJens Wiklanderof division, square root, and remainder.
1408*9403c583SJens WiklanderFor functions operating on the larger <NOBR>80-bit</NOBR> and
1409*9403c583SJens Wiklander<NOBR>128-bit</NOBR> formats, <CODE>extFloat80_t</CODE> and
1410*9403c583SJens Wiklander<CODE>float128_t</CODE>, code size has also generally been reduced.
1411*9403c583SJens Wiklander</P>
1412*9403c583SJens Wiklander
1413*9403c583SJens Wiklander<P>
1414*9403c583SJens WiklanderHowever, because <NOBR>Release 2</NOBR> compiled all of SoftFloat together as a
1415*9403c583SJens Wiklandersingle object file, compilers could make optimizations across function calls
1416*9403c583SJens Wiklanderwhen one SoftFloat function calls another.
1417*9403c583SJens WiklanderNow that the functions of SoftFloat are compiled separately and only afterward
1418*9403c583SJens Wiklanderlinked together into a program, there is not usually the same opportunity to
1419*9403c583SJens Wiklanderoptimize across function calls.
1420*9403c583SJens WiklanderSome loss of speed has been observed due to this change.
1421*9403c583SJens Wiklander</P>
1422*9403c583SJens Wiklander
1423*9403c583SJens Wiklander
1424*9403c583SJens Wiklander<H2>10. Future Directions</H2>
1425*9403c583SJens Wiklander
1426*9403c583SJens Wiklander<P>
1427*9403c583SJens WiklanderThe following improvements are anticipated for future releases of SoftFloat:
1428*9403c583SJens Wiklander<UL>
1429*9403c583SJens Wiklander<LI>
1430*9403c583SJens Wiklandersupport for the common <NOBR>16-bit</NOBR> &ldquo;half-precision&rdquo;
1431*9403c583SJens Wiklanderfloating-point format;
1432*9403c583SJens Wiklander<LI>
1433*9403c583SJens Wiklandermore functions from the 2008 version of the IEEE Floating-Point Standard;
1434*9403c583SJens Wiklander<LI>
1435*9403c583SJens Wiklanderconsistent, defined behavior for non-canonical representations of extended
1436*9403c583SJens Wiklanderformat <CODE>extFloat80_t</CODE> (discussed in <NOBR>section 4.4</NOBR>,
1437*9403c583SJens Wiklander<I>Non-canonical Representations in <CODE>extFloat80_t</CODE></I>).
1438*9403c583SJens Wiklander
1439*9403c583SJens Wiklander</UL>
1440*9403c583SJens Wiklander</P>
1441*9403c583SJens Wiklander
1442*9403c583SJens Wiklander
1443*9403c583SJens Wiklander<H2>11. Contact Information</H2>
1444*9403c583SJens Wiklander
1445*9403c583SJens Wiklander<P>
1446*9403c583SJens WiklanderAt the time of this writing, the most up-to-date information about SoftFloat
1447*9403c583SJens Wiklanderand the latest release can be found at the Web page
1448*9403c583SJens Wiklander<A HREF="http://www.jhauser.us/arithmetic/SoftFloat.html"><CODE>http://www.jhauser.us/arithmetic/SoftFloat.html</CODE></A>.
1449*9403c583SJens Wiklander</P>
1450*9403c583SJens Wiklander
1451*9403c583SJens Wiklander
1452*9403c583SJens Wiklander</BODY>
1453*9403c583SJens Wiklander
1454