xref: /OK3568_Linux_fs/yocto/bitbake/lib/bs4/NEWS.txt (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun= 4.3.2 (20131002) =
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun* Fixed a bug in which short Unicode input was improperly encoded to
4*4882a593Smuzhiyun  ASCII when checking whether or not it was the name of a file on
5*4882a593Smuzhiyun  disk. [bug=1227016]
6*4882a593Smuzhiyun
7*4882a593Smuzhiyun* Fixed a crash when a short input contains data not valid in
8*4882a593Smuzhiyun  filenames. [bug=1232604]
9*4882a593Smuzhiyun
10*4882a593Smuzhiyun* Fixed a bug that caused Unicode data put into UnicodeDammit to
11*4882a593Smuzhiyun  return None instead of the original data. [bug=1214983]
12*4882a593Smuzhiyun
13*4882a593Smuzhiyun* Combined two tests to stop a spurious test failure when tests are
14*4882a593Smuzhiyun  run by nosetests. [bug=1212445]
15*4882a593Smuzhiyun
16*4882a593Smuzhiyun= 4.3.1 (20130815) =
17*4882a593Smuzhiyun
18*4882a593Smuzhiyun* Fixed yet another problem with the html5lib tree builder, caused by
19*4882a593Smuzhiyun  html5lib's tendency to rearrange the tree during
20*4882a593Smuzhiyun  parsing. [bug=1189267]
21*4882a593Smuzhiyun
22*4882a593Smuzhiyun* Fixed a bug that caused the optimized version of find_all() to
23*4882a593Smuzhiyun  return nothing. [bug=1212655]
24*4882a593Smuzhiyun
25*4882a593Smuzhiyun= 4.3.0 (20130812) =
26*4882a593Smuzhiyun
27*4882a593Smuzhiyun* Instead of converting incoming data to Unicode and feeding it to the
28*4882a593Smuzhiyun  lxml tree builder in chunks, Beautiful Soup now makes successive
29*4882a593Smuzhiyun  guesses at the encoding of the incoming data, and tells lxml to
30*4882a593Smuzhiyun  parse the data as that encoding. Giving lxml more control over the
31*4882a593Smuzhiyun  parsing process improves performance and avoids a number of bugs and
32*4882a593Smuzhiyun  issues with the lxml parser which had previously required elaborate
33*4882a593Smuzhiyun  workarounds:
34*4882a593Smuzhiyun
35*4882a593Smuzhiyun  - An issue in which lxml refuses to parse Unicode strings on some
36*4882a593Smuzhiyun    systems. [bug=1180527]
37*4882a593Smuzhiyun
38*4882a593Smuzhiyun  - A returning bug that truncated documents longer than a (very
39*4882a593Smuzhiyun    small) size. [bug=963880]
40*4882a593Smuzhiyun
41*4882a593Smuzhiyun  - A returning bug in which extra spaces were added to a document if
42*4882a593Smuzhiyun    the document defined a charset other than UTF-8. [bug=972466]
43*4882a593Smuzhiyun
44*4882a593Smuzhiyun  This required a major overhaul of the tree builder architecture. If
45*4882a593Smuzhiyun  you wrote your own tree builder and didn't tell me, you'll need to
46*4882a593Smuzhiyun  modify your prepare_markup() method.
47*4882a593Smuzhiyun
48*4882a593Smuzhiyun* The UnicodeDammit code that makes guesses at encodings has been
49*4882a593Smuzhiyun  split into its own class, EncodingDetector. A lot of apparently
50*4882a593Smuzhiyun  redundant code has been removed from Unicode, Dammit, and some
51*4882a593Smuzhiyun  undocumented features have also been removed.
52*4882a593Smuzhiyun
53*4882a593Smuzhiyun* Beautiful Soup will issue a warning if instead of markup you pass it
54*4882a593Smuzhiyun  a URL or the name of a file on disk (a common beginner's mistake).
55*4882a593Smuzhiyun
56*4882a593Smuzhiyun* A number of optimizations improve the performance of the lxml tree
57*4882a593Smuzhiyun  builder by about 33%, the html.parser tree builder by about 20%, and
58*4882a593Smuzhiyun  the html5lib tree builder by about 15%.
59*4882a593Smuzhiyun
60*4882a593Smuzhiyun* All find_all calls should now return a ResultSet object. Patch by
61*4882a593Smuzhiyun  Aaron DeVore. [bug=1194034]
62*4882a593Smuzhiyun
63*4882a593Smuzhiyun= 4.2.1 (20130531) =
64*4882a593Smuzhiyun
65*4882a593Smuzhiyun* The default XML formatter will now replace ampersands even if they
66*4882a593Smuzhiyun  appear to be part of entities. That is, "<" will become
67*4882a593Smuzhiyun  "<". The old code was left over from Beautiful Soup 3, which
68*4882a593Smuzhiyun  didn't always turn entities into Unicode characters.
69*4882a593Smuzhiyun
70*4882a593Smuzhiyun  If you really want the old behavior (maybe because you add new
71*4882a593Smuzhiyun  strings to the tree, those strings include entities, and you want
72*4882a593Smuzhiyun  the formatter to leave them alone on output), it can be found in
73*4882a593Smuzhiyun  EntitySubstitution.substitute_xml_containing_entities(). [bug=1182183]
74*4882a593Smuzhiyun
75*4882a593Smuzhiyun* Gave new_string() the ability to create subclasses of
76*4882a593Smuzhiyun  NavigableString. [bug=1181986]
77*4882a593Smuzhiyun
78*4882a593Smuzhiyun* Fixed another bug by which the html5lib tree builder could create a
79*4882a593Smuzhiyun  disconnected tree. [bug=1182089]
80*4882a593Smuzhiyun
81*4882a593Smuzhiyun* The .previous_element of a BeautifulSoup object is now always None,
82*4882a593Smuzhiyun  not the last element to be parsed. [bug=1182089]
83*4882a593Smuzhiyun
84*4882a593Smuzhiyun* Fixed test failures when lxml is not installed. [bug=1181589]
85*4882a593Smuzhiyun
86*4882a593Smuzhiyun* html5lib now supports Python 3. Fixed some Python 2-specific
87*4882a593Smuzhiyun  code in the html5lib test suite. [bug=1181624]
88*4882a593Smuzhiyun
89*4882a593Smuzhiyun* The html.parser treebuilder can now handle numeric attributes in
90*4882a593Smuzhiyun  text when the hexidecimal name of the attribute starts with a
91*4882a593Smuzhiyun  capital X. Patch by Tim Shirley. [bug=1186242]
92*4882a593Smuzhiyun
93*4882a593Smuzhiyun= 4.2.0 (20130514) =
94*4882a593Smuzhiyun
95*4882a593Smuzhiyun* The Tag.select() method now supports a much wider variety of CSS
96*4882a593Smuzhiyun  selectors.
97*4882a593Smuzhiyun
98*4882a593Smuzhiyun - Added support for the adjacent sibling combinator (+) and the
99*4882a593Smuzhiyun   general sibling combinator (~). Tests by "liquider". [bug=1082144]
100*4882a593Smuzhiyun
101*4882a593Smuzhiyun - The combinators (>, +, and ~) can now combine with any supported
102*4882a593Smuzhiyun   selector, not just one that selects based on tag name.
103*4882a593Smuzhiyun
104*4882a593Smuzhiyun - Added limited support for the "nth-of-type" pseudo-class. Code
105*4882a593Smuzhiyun   by Sven Slootweg. [bug=1109952]
106*4882a593Smuzhiyun
107*4882a593Smuzhiyun* The BeautifulSoup class is now aliased to "_s" and "_soup", making
108*4882a593Smuzhiyun  it quicker to type the import statement in an interactive session:
109*4882a593Smuzhiyun
110*4882a593Smuzhiyun  from bs4 import _s
111*4882a593Smuzhiyun   or
112*4882a593Smuzhiyun  from bs4 import _soup
113*4882a593Smuzhiyun
114*4882a593Smuzhiyun  The alias may change in the future, so don't use this in code you're
115*4882a593Smuzhiyun  going to run more than once.
116*4882a593Smuzhiyun
117*4882a593Smuzhiyun* Added the 'diagnose' submodule, which includes several useful
118*4882a593Smuzhiyun  functions for reporting problems and doing tech support.
119*4882a593Smuzhiyun
120*4882a593Smuzhiyun  - diagnose(data) tries the given markup on every installed parser,
121*4882a593Smuzhiyun    reporting exceptions and displaying successes. If a parser is not
122*4882a593Smuzhiyun    installed, diagnose() mentions this fact.
123*4882a593Smuzhiyun
124*4882a593Smuzhiyun  - lxml_trace(data, html=True) runs the given markup through lxml's
125*4882a593Smuzhiyun    XML parser or HTML parser, and prints out the parser events as
126*4882a593Smuzhiyun    they happen. This helps you quickly determine whether a given
127*4882a593Smuzhiyun    problem occurs in lxml code or Beautiful Soup code.
128*4882a593Smuzhiyun
129*4882a593Smuzhiyun  - htmlparser_trace(data) is the same thing, but for Python's
130*4882a593Smuzhiyun    built-in HTMLParser class.
131*4882a593Smuzhiyun
132*4882a593Smuzhiyun* In an HTML document, the contents of a <script> or <style> tag will
133*4882a593Smuzhiyun  no longer undergo entity substitution by default. XML documents work
134*4882a593Smuzhiyun  the same way they did before. [bug=1085953]
135*4882a593Smuzhiyun
136*4882a593Smuzhiyun* Methods like get_text() and properties like .strings now only give
137*4882a593Smuzhiyun  you strings that are visible in the document--no comments or
138*4882a593Smuzhiyun  processing commands. [bug=1050164]
139*4882a593Smuzhiyun
140*4882a593Smuzhiyun* The prettify() method now leaves the contents of <pre> tags
141*4882a593Smuzhiyun  alone. [bug=1095654]
142*4882a593Smuzhiyun
143*4882a593Smuzhiyun* Fix a bug in the html5lib treebuilder which sometimes created
144*4882a593Smuzhiyun  disconnected trees. [bug=1039527]
145*4882a593Smuzhiyun
146*4882a593Smuzhiyun* Fix a bug in the lxml treebuilder which crashed when a tag included
147*4882a593Smuzhiyun  an attribute from the predefined "xml:" namespace. [bug=1065617]
148*4882a593Smuzhiyun
149*4882a593Smuzhiyun* Fix a bug by which keyword arguments to find_parent() were not
150*4882a593Smuzhiyun  being passed on. [bug=1126734]
151*4882a593Smuzhiyun
152*4882a593Smuzhiyun* Stop a crash when unwisely messing with a tag that's been
153*4882a593Smuzhiyun  decomposed. [bug=1097699]
154*4882a593Smuzhiyun
155*4882a593Smuzhiyun* Now that lxml's segfault on invalid doctype has been fixed, fixed a
156*4882a593Smuzhiyun  corresponding problem on the Beautiful Soup end that was previously
157*4882a593Smuzhiyun  invisible. [bug=984936]
158*4882a593Smuzhiyun
159*4882a593Smuzhiyun* Fixed an exception when an overspecified CSS selector didn't match
160*4882a593Smuzhiyun  anything. Code by Stefaan Lippens. [bug=1168167]
161*4882a593Smuzhiyun
162*4882a593Smuzhiyun= 4.1.3 (20120820) =
163*4882a593Smuzhiyun
164*4882a593Smuzhiyun* Skipped a test under Python 2.6 and Python 3.1 to avoid a spurious
165*4882a593Smuzhiyun  test failure caused by the lousy HTMLParser in those
166*4882a593Smuzhiyun  versions. [bug=1038503]
167*4882a593Smuzhiyun
168*4882a593Smuzhiyun* Raise a more specific error (FeatureNotFound) when a requested
169*4882a593Smuzhiyun  parser or parser feature is not installed. Raise NotImplementedError
170*4882a593Smuzhiyun  instead of ValueError when the user calls insert_before() or
171*4882a593Smuzhiyun  insert_after() on the BeautifulSoup object itself. Patch by Aaron
172*4882a593Smuzhiyun  Devore. [bug=1038301]
173*4882a593Smuzhiyun
174*4882a593Smuzhiyun= 4.1.2 (20120817) =
175*4882a593Smuzhiyun
176*4882a593Smuzhiyun* As per PEP-8, allow searching by CSS class using the 'class_'
177*4882a593Smuzhiyun  keyword argument. [bug=1037624]
178*4882a593Smuzhiyun
179*4882a593Smuzhiyun* Display namespace prefixes for namespaced attribute names, instead of
180*4882a593Smuzhiyun  the fully-qualified names given by the lxml parser. [bug=1037597]
181*4882a593Smuzhiyun
182*4882a593Smuzhiyun* Fixed a crash on encoding when an attribute name contained
183*4882a593Smuzhiyun  non-ASCII characters.
184*4882a593Smuzhiyun
185*4882a593Smuzhiyun* When sniffing encodings, if the cchardet library is installed,
186*4882a593Smuzhiyun  Beautiful Soup uses it instead of chardet. cchardet is much
187*4882a593Smuzhiyun  faster. [bug=1020748]
188*4882a593Smuzhiyun
189*4882a593Smuzhiyun* Use logging.warning() instead of warning.warn() to notify the user
190*4882a593Smuzhiyun  that characters were replaced with REPLACEMENT
191*4882a593Smuzhiyun  CHARACTER. [bug=1013862]
192*4882a593Smuzhiyun
193*4882a593Smuzhiyun= 4.1.1 (20120703) =
194*4882a593Smuzhiyun
195*4882a593Smuzhiyun* Fixed an html5lib tree builder crash which happened when html5lib
196*4882a593Smuzhiyun  moved a tag with a multivalued attribute from one part of the tree
197*4882a593Smuzhiyun  to another. [bug=1019603]
198*4882a593Smuzhiyun
199*4882a593Smuzhiyun* Correctly display closing tags with an XML namespace declared. Patch
200*4882a593Smuzhiyun  by Andreas Kostyrka. [bug=1019635]
201*4882a593Smuzhiyun
202*4882a593Smuzhiyun* Fixed a typo that made parsing significantly slower than it should
203*4882a593Smuzhiyun  have been, and also waited too long to close tags with XML
204*4882a593Smuzhiyun  namespaces. [bug=1020268]
205*4882a593Smuzhiyun
206*4882a593Smuzhiyun* get_text() now returns an empty Unicode string if there is no text,
207*4882a593Smuzhiyun  rather than an empty bytestring. [bug=1020387]
208*4882a593Smuzhiyun
209*4882a593Smuzhiyun= 4.1.0 (20120529) =
210*4882a593Smuzhiyun
211*4882a593Smuzhiyun* Added experimental support for fixing Windows-1252 characters
212*4882a593Smuzhiyun  embedded in UTF-8 documents. (UnicodeDammit.detwingle())
213*4882a593Smuzhiyun
214*4882a593Smuzhiyun* Fixed the handling of &quot; with the built-in parser. [bug=993871]
215*4882a593Smuzhiyun
216*4882a593Smuzhiyun* Comments, processing instructions, document type declarations, and
217*4882a593Smuzhiyun  markup declarations are now treated as preformatted strings, the way
218*4882a593Smuzhiyun  CData blocks are. [bug=1001025]
219*4882a593Smuzhiyun
220*4882a593Smuzhiyun* Fixed a bug with the lxml treebuilder that prevented the user from
221*4882a593Smuzhiyun  adding attributes to a tag that didn't originally have
222*4882a593Smuzhiyun  attributes. [bug=1002378] Thanks to Oliver Beattie for the patch.
223*4882a593Smuzhiyun
224*4882a593Smuzhiyun* Fixed some edge-case bugs having to do with inserting an element
225*4882a593Smuzhiyun  into a tag it's already inside, and replacing one of a tag's
226*4882a593Smuzhiyun  children with another. [bug=997529]
227*4882a593Smuzhiyun
228*4882a593Smuzhiyun* Added the ability to search for attribute values specified in UTF-8. [bug=1003974]
229*4882a593Smuzhiyun
230*4882a593Smuzhiyun  This caused a major refactoring of the search code. All the tests
231*4882a593Smuzhiyun  pass, but it's possible that some searches will behave differently.
232*4882a593Smuzhiyun
233*4882a593Smuzhiyun= 4.0.5 (20120427) =
234*4882a593Smuzhiyun
235*4882a593Smuzhiyun* Added a new method, wrap(), which wraps an element in a tag.
236*4882a593Smuzhiyun
237*4882a593Smuzhiyun* Renamed replace_with_children() to unwrap(), which is easier to
238*4882a593Smuzhiyun  understand and also the jQuery name of the function.
239*4882a593Smuzhiyun
240*4882a593Smuzhiyun* Made encoding substitution in <meta> tags completely transparent (no
241*4882a593Smuzhiyun  more %SOUP-ENCODING%).
242*4882a593Smuzhiyun
243*4882a593Smuzhiyun* Fixed a bug in decoding data that contained a byte-order mark, such
244*4882a593Smuzhiyun  as data encoded in UTF-16LE. [bug=988980]
245*4882a593Smuzhiyun
246*4882a593Smuzhiyun* Fixed a bug that made the HTMLParser treebuilder generate XML
247*4882a593Smuzhiyun  definitions ending with two question marks instead of
248*4882a593Smuzhiyun  one. [bug=984258]
249*4882a593Smuzhiyun
250*4882a593Smuzhiyun* Upon document generation, CData objects are no longer run through
251*4882a593Smuzhiyun  the formatter. [bug=988905]
252*4882a593Smuzhiyun
253*4882a593Smuzhiyun* The test suite now passes when lxml is not installed, whether or not
254*4882a593Smuzhiyun  html5lib is installed. [bug=987004]
255*4882a593Smuzhiyun
256*4882a593Smuzhiyun* Print a warning on HTMLParseErrors to let people know they should
257*4882a593Smuzhiyun  install a better parser library.
258*4882a593Smuzhiyun
259*4882a593Smuzhiyun= 4.0.4 (20120416) =
260*4882a593Smuzhiyun
261*4882a593Smuzhiyun* Fixed a bug that sometimes created disconnected trees.
262*4882a593Smuzhiyun
263*4882a593Smuzhiyun* Fixed a bug with the string setter that moved a string around the
264*4882a593Smuzhiyun  tree instead of copying it. [bug=983050]
265*4882a593Smuzhiyun
266*4882a593Smuzhiyun* Attribute values are now run through the provided output formatter.
267*4882a593Smuzhiyun  Previously they were always run through the 'minimal' formatter. In
268*4882a593Smuzhiyun  the future I may make it possible to specify different formatters
269*4882a593Smuzhiyun  for attribute values and strings, but for now, consistent behavior
270*4882a593Smuzhiyun  is better than inconsistent behavior. [bug=980237]
271*4882a593Smuzhiyun
272*4882a593Smuzhiyun* Added the missing renderContents method from Beautiful Soup 3. Also
273*4882a593Smuzhiyun  added an encode_contents() method to go along with decode_contents().
274*4882a593Smuzhiyun
275*4882a593Smuzhiyun* Give a more useful error when the user tries to run the Python 2
276*4882a593Smuzhiyun  version of BS under Python 3.
277*4882a593Smuzhiyun
278*4882a593Smuzhiyun* UnicodeDammit can now convert Microsoft smart quotes to ASCII with
279*4882a593Smuzhiyun  UnicodeDammit(markup, smart_quotes_to="ascii").
280*4882a593Smuzhiyun
281*4882a593Smuzhiyun= 4.0.3 (20120403) =
282*4882a593Smuzhiyun
283*4882a593Smuzhiyun* Fixed a typo that caused some versions of Python 3 to convert the
284*4882a593Smuzhiyun  Beautiful Soup codebase incorrectly.
285*4882a593Smuzhiyun
286*4882a593Smuzhiyun* Got rid of the 4.0.2 workaround for HTML documents--it was
287*4882a593Smuzhiyun  unnecessary and the workaround was triggering a (possibly different,
288*4882a593Smuzhiyun  but related) bug in lxml. [bug=972466]
289*4882a593Smuzhiyun
290*4882a593Smuzhiyun= 4.0.2 (20120326) =
291*4882a593Smuzhiyun
292*4882a593Smuzhiyun* Worked around a possible bug in lxml that prevents non-tiny XML
293*4882a593Smuzhiyun  documents from being parsed. [bug=963880, bug=963936]
294*4882a593Smuzhiyun
295*4882a593Smuzhiyun* Fixed a bug where specifying `text` while also searching for a tag
296*4882a593Smuzhiyun  only worked if `text` wanted an exact string match. [bug=955942]
297*4882a593Smuzhiyun
298*4882a593Smuzhiyun= 4.0.1 (20120314) =
299*4882a593Smuzhiyun
300*4882a593Smuzhiyun* This is the first official release of Beautiful Soup 4. There is no
301*4882a593Smuzhiyun  4.0.0 release, to eliminate any possibility that packaging software
302*4882a593Smuzhiyun  might treat "4.0.0" as being an earlier version than "4.0.0b10".
303*4882a593Smuzhiyun
304*4882a593Smuzhiyun* Brought BS up to date with the latest release of soupselect, adding
305*4882a593Smuzhiyun  CSS selector support for direct descendant matches and multiple CSS
306*4882a593Smuzhiyun  class matches.
307*4882a593Smuzhiyun
308*4882a593Smuzhiyun= 4.0.0b10 (20120302) =
309*4882a593Smuzhiyun
310*4882a593Smuzhiyun* Added support for simple CSS selectors, taken from the soupselect project.
311*4882a593Smuzhiyun
312*4882a593Smuzhiyun* Fixed a crash when using html5lib. [bug=943246]
313*4882a593Smuzhiyun
314*4882a593Smuzhiyun* In HTML5-style <meta charset="foo"> tags, the value of the "charset"
315*4882a593Smuzhiyun  attribute is now replaced with the appropriate encoding on
316*4882a593Smuzhiyun  output. [bug=942714]
317*4882a593Smuzhiyun
318*4882a593Smuzhiyun* Fixed a bug that caused calling a tag to sometimes call find_all()
319*4882a593Smuzhiyun  with the wrong arguments. [bug=944426]
320*4882a593Smuzhiyun
321*4882a593Smuzhiyun* For backwards compatibility, brought back the BeautifulStoneSoup
322*4882a593Smuzhiyun  class as a deprecated wrapper around BeautifulSoup.
323*4882a593Smuzhiyun
324*4882a593Smuzhiyun= 4.0.0b9 (20120228) =
325*4882a593Smuzhiyun
326*4882a593Smuzhiyun* Fixed the string representation of DOCTYPEs that have both a public
327*4882a593Smuzhiyun  ID and a system ID.
328*4882a593Smuzhiyun
329*4882a593Smuzhiyun* Fixed the generated XML declaration.
330*4882a593Smuzhiyun
331*4882a593Smuzhiyun* Renamed Tag.nsprefix to Tag.prefix, for consistency with
332*4882a593Smuzhiyun  NamespacedAttribute.
333*4882a593Smuzhiyun
334*4882a593Smuzhiyun* Fixed a test failure that occured on Python 3.x when chardet was
335*4882a593Smuzhiyun  installed.
336*4882a593Smuzhiyun
337*4882a593Smuzhiyun* Made prettify() return Unicode by default, so it will look nice on
338*4882a593Smuzhiyun  Python 3 when passed into print().
339*4882a593Smuzhiyun
340*4882a593Smuzhiyun= 4.0.0b8 (20120224) =
341*4882a593Smuzhiyun
342*4882a593Smuzhiyun* All tree builders now preserve namespace information in the
343*4882a593Smuzhiyun  documents they parse. If you use the html5lib parser or lxml's XML
344*4882a593Smuzhiyun  parser, you can access the namespace URL for a tag as tag.namespace.
345*4882a593Smuzhiyun
346*4882a593Smuzhiyun  However, there is no special support for namespace-oriented
347*4882a593Smuzhiyun  searching or tree manipulation. When you search the tree, you need
348*4882a593Smuzhiyun  to use namespace prefixes exactly as they're used in the original
349*4882a593Smuzhiyun  document.
350*4882a593Smuzhiyun
351*4882a593Smuzhiyun* The string representation of a DOCTYPE always ends in a newline.
352*4882a593Smuzhiyun
353*4882a593Smuzhiyun* Issue a warning if the user tries to use a SoupStrainer in
354*4882a593Smuzhiyun  conjunction with the html5lib tree builder, which doesn't support
355*4882a593Smuzhiyun  them.
356*4882a593Smuzhiyun
357*4882a593Smuzhiyun= 4.0.0b7 (20120223) =
358*4882a593Smuzhiyun
359*4882a593Smuzhiyun* Upon decoding to string, any characters that can't be represented in
360*4882a593Smuzhiyun  your chosen encoding will be converted into numeric XML entity
361*4882a593Smuzhiyun  references.
362*4882a593Smuzhiyun
363*4882a593Smuzhiyun* Issue a warning if characters were replaced with REPLACEMENT
364*4882a593Smuzhiyun  CHARACTER during Unicode conversion.
365*4882a593Smuzhiyun
366*4882a593Smuzhiyun* Restored compatibility with Python 2.6.
367*4882a593Smuzhiyun
368*4882a593Smuzhiyun* The install process no longer installs docs or auxillary text files.
369*4882a593Smuzhiyun
370*4882a593Smuzhiyun* It's now possible to deepcopy a BeautifulSoup object created with
371*4882a593Smuzhiyun  Python's built-in HTML parser.
372*4882a593Smuzhiyun
373*4882a593Smuzhiyun* About 100 unit tests that "test" the behavior of various parsers on
374*4882a593Smuzhiyun  invalid markup have been removed. Legitimate changes to those
375*4882a593Smuzhiyun  parsers caused these tests to fail, indicating that perhaps
376*4882a593Smuzhiyun  Beautiful Soup should not test the behavior of foreign
377*4882a593Smuzhiyun  libraries.
378*4882a593Smuzhiyun
379*4882a593Smuzhiyun  The problematic unit tests have been reformulated as informational
380*4882a593Smuzhiyun  comparisons generated by the script
381*4882a593Smuzhiyun  scripts/demonstrate_parser_differences.py.
382*4882a593Smuzhiyun
383*4882a593Smuzhiyun  This makes Beautiful Soup compatible with html5lib version 0.95 and
384*4882a593Smuzhiyun  future versions of HTMLParser.
385*4882a593Smuzhiyun
386*4882a593Smuzhiyun= 4.0.0b6 (20120216) =
387*4882a593Smuzhiyun
388*4882a593Smuzhiyun* Multi-valued attributes like "class" always have a list of values,
389*4882a593Smuzhiyun  even if there's only one value in the list.
390*4882a593Smuzhiyun
391*4882a593Smuzhiyun* Added a number of multi-valued attributes defined in HTML5.
392*4882a593Smuzhiyun
393*4882a593Smuzhiyun* Stopped generating a space before the slash that closes an
394*4882a593Smuzhiyun  empty-element tag. This may come back if I add a special XHTML mode
395*4882a593Smuzhiyun  (http://www.w3.org/TR/xhtml1/#C_2), but right now it's pretty
396*4882a593Smuzhiyun  useless.
397*4882a593Smuzhiyun
398*4882a593Smuzhiyun* Passing text along with tag-specific arguments to a find* method:
399*4882a593Smuzhiyun
400*4882a593Smuzhiyun   find("a", text="Click here")
401*4882a593Smuzhiyun
402*4882a593Smuzhiyun  will find tags that contain the given text as their
403*4882a593Smuzhiyun  .string. Previously, the tag-specific arguments were ignored and
404*4882a593Smuzhiyun  only strings were searched.
405*4882a593Smuzhiyun
406*4882a593Smuzhiyun* Fixed a bug that caused the html5lib tree builder to build a
407*4882a593Smuzhiyun  partially disconnected tree. Generally cleaned up the html5lib tree
408*4882a593Smuzhiyun  builder.
409*4882a593Smuzhiyun
410*4882a593Smuzhiyun* If you restrict a multi-valued attribute like "class" to a string
411*4882a593Smuzhiyun  that contains spaces, Beautiful Soup will only consider it a match
412*4882a593Smuzhiyun  if the values correspond to that specific string.
413*4882a593Smuzhiyun
414*4882a593Smuzhiyun= 4.0.0b5 (20120209) =
415*4882a593Smuzhiyun
416*4882a593Smuzhiyun* Rationalized Beautiful Soup's treatment of CSS class. A tag
417*4882a593Smuzhiyun  belonging to multiple CSS classes is treated as having a list of
418*4882a593Smuzhiyun  values for the 'class' attribute. Searching for a CSS class will
419*4882a593Smuzhiyun  match *any* of the CSS classes.
420*4882a593Smuzhiyun
421*4882a593Smuzhiyun  This actually affects all attributes that the HTML standard defines
422*4882a593Smuzhiyun  as taking multiple values (class, rel, rev, archive, accept-charset,
423*4882a593Smuzhiyun  and headers), but 'class' is by far the most common. [bug=41034]
424*4882a593Smuzhiyun
425*4882a593Smuzhiyun* If you pass anything other than a dictionary as the second argument
426*4882a593Smuzhiyun  to one of the find* methods, it'll assume you want to use that
427*4882a593Smuzhiyun  object to search against a tag's CSS classes. Previously this only
428*4882a593Smuzhiyun  worked if you passed in a string.
429*4882a593Smuzhiyun
430*4882a593Smuzhiyun* Fixed a bug that caused a crash when you passed a dictionary as an
431*4882a593Smuzhiyun  attribute value (possibly because you mistyped "attrs"). [bug=842419]
432*4882a593Smuzhiyun
433*4882a593Smuzhiyun* Unicode, Dammit now detects the encoding in HTML 5-style <meta> tags
434*4882a593Smuzhiyun  like <meta charset="utf-8" />. [bug=837268]
435*4882a593Smuzhiyun
436*4882a593Smuzhiyun* If Unicode, Dammit can't figure out a consistent encoding for a
437*4882a593Smuzhiyun  page, it will try each of its guesses again, with errors="replace"
438*4882a593Smuzhiyun  instead of errors="strict". This may mean that some data gets
439*4882a593Smuzhiyun  replaced with REPLACEMENT CHARACTER, but at least most of it will
440*4882a593Smuzhiyun  get turned into Unicode. [bug=754903]
441*4882a593Smuzhiyun
442*4882a593Smuzhiyun* Patched over a bug in html5lib (?) that was crashing Beautiful Soup
443*4882a593Smuzhiyun  on certain kinds of markup. [bug=838800]
444*4882a593Smuzhiyun
445*4882a593Smuzhiyun* Fixed a bug that wrecked the tree if you replaced an element with an
446*4882a593Smuzhiyun  empty string. [bug=728697]
447*4882a593Smuzhiyun
448*4882a593Smuzhiyun* Improved Unicode, Dammit's behavior when you give it Unicode to
449*4882a593Smuzhiyun  begin with.
450*4882a593Smuzhiyun
451*4882a593Smuzhiyun= 4.0.0b4 (20120208) =
452*4882a593Smuzhiyun
453*4882a593Smuzhiyun* Added BeautifulSoup.new_string() to go along with BeautifulSoup.new_tag()
454*4882a593Smuzhiyun
455*4882a593Smuzhiyun* BeautifulSoup.new_tag() will follow the rules of whatever
456*4882a593Smuzhiyun  tree-builder was used to create the original BeautifulSoup object. A
457*4882a593Smuzhiyun  new <p> tag will look like "<p />" if the soup object was created to
458*4882a593Smuzhiyun  parse XML, but it will look like "<p></p>" if the soup object was
459*4882a593Smuzhiyun  created to parse HTML.
460*4882a593Smuzhiyun
461*4882a593Smuzhiyun* We pass in strict=False to html.parser on Python 3, greatly
462*4882a593Smuzhiyun  improving html.parser's ability to handle bad HTML.
463*4882a593Smuzhiyun
464*4882a593Smuzhiyun* We also monkeypatch a serious bug in html.parser that made
465*4882a593Smuzhiyun  strict=False disastrous on Python 3.2.2.
466*4882a593Smuzhiyun
467*4882a593Smuzhiyun* Replaced the "substitute_html_entities" argument with the
468*4882a593Smuzhiyun  more general "formatter" argument.
469*4882a593Smuzhiyun
470*4882a593Smuzhiyun* Bare ampersands and angle brackets are always converted to XML
471*4882a593Smuzhiyun  entities unless the user prevents it.
472*4882a593Smuzhiyun
473*4882a593Smuzhiyun* Added PageElement.insert_before() and PageElement.insert_after(),
474*4882a593Smuzhiyun  which let you put an element into the parse tree with respect to
475*4882a593Smuzhiyun  some other element.
476*4882a593Smuzhiyun
477*4882a593Smuzhiyun* Raise an exception when the user tries to do something nonsensical
478*4882a593Smuzhiyun  like insert a tag into itself.
479*4882a593Smuzhiyun
480*4882a593Smuzhiyun
481*4882a593Smuzhiyun= 4.0.0b3 (20120203) =
482*4882a593Smuzhiyun
483*4882a593SmuzhiyunBeautiful Soup 4 is a nearly-complete rewrite that removes Beautiful
484*4882a593SmuzhiyunSoup's custom HTML parser in favor of a system that lets you write a
485*4882a593Smuzhiyunlittle glue code and plug in any HTML or XML parser you want.
486*4882a593Smuzhiyun
487*4882a593SmuzhiyunBeautiful Soup 4.0 comes with glue code for four parsers:
488*4882a593Smuzhiyun
489*4882a593Smuzhiyun * Python's standard HTMLParser (html.parser in Python 3)
490*4882a593Smuzhiyun * lxml's HTML and XML parsers
491*4882a593Smuzhiyun * html5lib's HTML parser
492*4882a593Smuzhiyun
493*4882a593SmuzhiyunHTMLParser is the default, but I recommend you install lxml if you
494*4882a593Smuzhiyuncan.
495*4882a593Smuzhiyun
496*4882a593SmuzhiyunFor complete documentation, see the Sphinx documentation in
497*4882a593Smuzhiyunbs4/doc/source/. What follows is a summary of the changes from
498*4882a593SmuzhiyunBeautiful Soup 3.
499*4882a593Smuzhiyun
500*4882a593Smuzhiyun=== The module name has changed ===
501*4882a593Smuzhiyun
502*4882a593SmuzhiyunPreviously you imported the BeautifulSoup class from a module also
503*4882a593Smuzhiyuncalled BeautifulSoup. To save keystrokes and make it clear which
504*4882a593Smuzhiyunversion of the API is in use, the module is now called 'bs4':
505*4882a593Smuzhiyun
506*4882a593Smuzhiyun    >>> from bs4 import BeautifulSoup
507*4882a593Smuzhiyun
508*4882a593Smuzhiyun=== It works with Python 3 ===
509*4882a593Smuzhiyun
510*4882a593SmuzhiyunBeautiful Soup 3.1.0 worked with Python 3, but the parser it used was
511*4882a593Smuzhiyunso bad that it barely worked at all. Beautiful Soup 4 works with
512*4882a593SmuzhiyunPython 3, and since its parser is pluggable, you don't sacrifice
513*4882a593Smuzhiyunquality.
514*4882a593Smuzhiyun
515*4882a593SmuzhiyunSpecial thanks to Thomas Kluyver and Ezio Melotti for getting Python 3
516*4882a593Smuzhiyunsupport to the finish line. Ezio Melotti is also to thank for greatly
517*4882a593Smuzhiyunimproving the HTML parser that comes with Python 3.2.
518*4882a593Smuzhiyun
519*4882a593Smuzhiyun=== CDATA sections are normal text, if they're understood at all. ===
520*4882a593Smuzhiyun
521*4882a593SmuzhiyunCurrently, the lxml and html5lib HTML parsers ignore CDATA sections in
522*4882a593Smuzhiyunmarkup:
523*4882a593Smuzhiyun
524*4882a593Smuzhiyun <p><![CDATA[foo]]></p> => <p></p>
525*4882a593Smuzhiyun
526*4882a593SmuzhiyunA future version of html5lib will turn CDATA sections into text nodes,
527*4882a593Smuzhiyunbut only within tags like <svg> and <math>:
528*4882a593Smuzhiyun
529*4882a593Smuzhiyun <svg><![CDATA[foo]]></svg> => <p>foo</p>
530*4882a593Smuzhiyun
531*4882a593SmuzhiyunThe default XML parser (which uses lxml behind the scenes) turns CDATA
532*4882a593Smuzhiyunsections into ordinary text elements:
533*4882a593Smuzhiyun
534*4882a593Smuzhiyun <p><![CDATA[foo]]></p> => <p>foo</p>
535*4882a593Smuzhiyun
536*4882a593SmuzhiyunIn theory it's possible to preserve the CDATA sections when using the
537*4882a593SmuzhiyunXML parser, but I don't see how to get it to work in practice.
538*4882a593Smuzhiyun
539*4882a593Smuzhiyun=== Miscellaneous other stuff ===
540*4882a593Smuzhiyun
541*4882a593SmuzhiyunIf the BeautifulSoup instance has .is_xml set to True, an appropriate
542*4882a593SmuzhiyunXML declaration will be emitted when the tree is transformed into a
543*4882a593Smuzhiyunstring:
544*4882a593Smuzhiyun
545*4882a593Smuzhiyun    <?xml version="1.0" encoding="utf-8">
546*4882a593Smuzhiyun    <markup>
547*4882a593Smuzhiyun     ...
548*4882a593Smuzhiyun    </markup>
549*4882a593Smuzhiyun
550*4882a593SmuzhiyunThe ['lxml', 'xml'] tree builder sets .is_xml to True; the other tree
551*4882a593Smuzhiyunbuilders set it to False. If you want to parse XHTML with an HTML
552*4882a593Smuzhiyunparser, you can set it manually.
553*4882a593Smuzhiyun
554*4882a593Smuzhiyun
555*4882a593Smuzhiyun= 3.2.0 =
556*4882a593Smuzhiyun
557*4882a593SmuzhiyunThe 3.1 series wasn't very useful, so I renamed the 3.0 series to 3.2
558*4882a593Smuzhiyunto make it obvious which one you should use.
559*4882a593Smuzhiyun
560*4882a593Smuzhiyun= 3.1.0 =
561*4882a593Smuzhiyun
562*4882a593SmuzhiyunA hybrid version that supports 2.4 and can be automatically converted
563*4882a593Smuzhiyunto run under Python 3.0. There are three backwards-incompatible
564*4882a593Smuzhiyunchanges you should be aware of, but no new features or deliberate
565*4882a593Smuzhiyunbehavior changes.
566*4882a593Smuzhiyun
567*4882a593Smuzhiyun1. str() may no longer do what you want. This is because the meaning
568*4882a593Smuzhiyunof str() inverts between Python 2 and 3; in Python 2 it gives you a
569*4882a593Smuzhiyunbyte string, in Python 3 it gives you a Unicode string.
570*4882a593Smuzhiyun
571*4882a593SmuzhiyunThe effect of this is that you can't pass an encoding to .__str__
572*4882a593Smuzhiyunanymore. Use encode() to get a string and decode() to get Unicode, and
573*4882a593Smuzhiyunyou'll be ready (well, readier) for Python 3.
574*4882a593Smuzhiyun
575*4882a593Smuzhiyun2. Beautiful Soup is now based on HTMLParser rather than SGMLParser,
576*4882a593Smuzhiyunwhich is gone in Python 3. There's some bad HTML that SGMLParser
577*4882a593Smuzhiyunhandled but HTMLParser doesn't, usually to do with attribute values
578*4882a593Smuzhiyunthat aren't closed or have brackets inside them:
579*4882a593Smuzhiyun
580*4882a593Smuzhiyun  <a href="foo</a>, </a><a href="bar">baz</a>
581*4882a593Smuzhiyun  <a b="<a>">', '<a b="&lt;a&gt;"></a><a>"></a>
582*4882a593Smuzhiyun
583*4882a593SmuzhiyunA later version of Beautiful Soup will allow you to plug in different
584*4882a593Smuzhiyunparsers to make tradeoffs between speed and the ability to handle bad
585*4882a593SmuzhiyunHTML.
586*4882a593Smuzhiyun
587*4882a593Smuzhiyun3. In Python 3 (but not Python 2), HTMLParser converts entities within
588*4882a593Smuzhiyunattributes to the corresponding Unicode characters. In Python 2 it's
589*4882a593Smuzhiyunpossible to parse this string and leave the &eacute; intact.
590*4882a593Smuzhiyun
591*4882a593Smuzhiyun <a href="http://crummy.com?sacr&eacute;&bleu">
592*4882a593Smuzhiyun
593*4882a593SmuzhiyunIn Python 3, the &eacute; is always converted to \xe9 during
594*4882a593Smuzhiyunparsing.
595*4882a593Smuzhiyun
596*4882a593Smuzhiyun
597*4882a593Smuzhiyun= 3.0.7a =
598*4882a593Smuzhiyun
599*4882a593SmuzhiyunAdded an import that makes BS work in Python 2.3.
600*4882a593Smuzhiyun
601*4882a593Smuzhiyun
602*4882a593Smuzhiyun= 3.0.7 =
603*4882a593Smuzhiyun
604*4882a593SmuzhiyunFixed a UnicodeDecodeError when unpickling documents that contain
605*4882a593Smuzhiyunnon-ASCII characters.
606*4882a593Smuzhiyun
607*4882a593SmuzhiyunFixed a TypeError that occured in some circumstances when a tag
608*4882a593Smuzhiyuncontained no text.
609*4882a593Smuzhiyun
610*4882a593SmuzhiyunJump through hoops to avoid the use of chardet, which can be extremely
611*4882a593Smuzhiyunslow in some circumstances. UTF-8 documents should never trigger the
612*4882a593Smuzhiyunuse of chardet.
613*4882a593Smuzhiyun
614*4882a593SmuzhiyunWhitespace is preserved inside <pre> and <textarea> tags that contain
615*4882a593Smuzhiyunnothing but whitespace.
616*4882a593Smuzhiyun
617*4882a593SmuzhiyunBeautiful Soup can now parse a doctype that's scoped to an XML namespace.
618*4882a593Smuzhiyun
619*4882a593Smuzhiyun
620*4882a593Smuzhiyun= 3.0.6 =
621*4882a593Smuzhiyun
622*4882a593SmuzhiyunGot rid of a very old debug line that prevented chardet from working.
623*4882a593Smuzhiyun
624*4882a593SmuzhiyunAdded a Tag.decompose() method that completely disconnects a tree or a
625*4882a593Smuzhiyunsubset of a tree, breaking it up into bite-sized pieces that are
626*4882a593Smuzhiyuneasy for the garbage collecter to collect.
627*4882a593Smuzhiyun
628*4882a593SmuzhiyunTag.extract() now returns the tag that was extracted.
629*4882a593Smuzhiyun
630*4882a593SmuzhiyunTag.findNext() now does something with the keyword arguments you pass
631*4882a593Smuzhiyunit instead of dropping them on the floor.
632*4882a593Smuzhiyun
633*4882a593SmuzhiyunFixed a Unicode conversion bug.
634*4882a593Smuzhiyun
635*4882a593SmuzhiyunFixed a bug that garbled some <meta> tags when rewriting them.
636*4882a593Smuzhiyun
637*4882a593Smuzhiyun
638*4882a593Smuzhiyun= 3.0.5 =
639*4882a593Smuzhiyun
640*4882a593SmuzhiyunSoup objects can now be pickled, and copied with copy.deepcopy.
641*4882a593Smuzhiyun
642*4882a593SmuzhiyunTag.append now works properly on existing BS objects. (It wasn't
643*4882a593Smuzhiyunoriginally intended for outside use, but it can be now.) (Giles
644*4882a593SmuzhiyunRadford)
645*4882a593Smuzhiyun
646*4882a593SmuzhiyunPassing in a nonexistent encoding will no longer crash the parser on
647*4882a593SmuzhiyunPython 2.4 (John Nagle).
648*4882a593Smuzhiyun
649*4882a593SmuzhiyunFixed an underlying bug in SGMLParser that thinks ASCII has 255
650*4882a593Smuzhiyuncharacters instead of 127 (John Nagle).
651*4882a593Smuzhiyun
652*4882a593SmuzhiyunEntities are converted more consistently to Unicode characters.
653*4882a593Smuzhiyun
654*4882a593SmuzhiyunEntity references in attribute values are now converted to Unicode
655*4882a593Smuzhiyuncharacters when appropriate. Numeric entities are always converted,
656*4882a593Smuzhiyunbecause SGMLParser always converts them outside of attribute values.
657*4882a593Smuzhiyun
658*4882a593SmuzhiyunALL_ENTITIES happens to just be the XHTML entities, so I renamed it to
659*4882a593SmuzhiyunXHTML_ENTITIES.
660*4882a593Smuzhiyun
661*4882a593SmuzhiyunThe regular expression for bare ampersands was too loose. In some
662*4882a593Smuzhiyuncases ampersands were not being escaped. (Sam Ruby?)
663*4882a593Smuzhiyun
664*4882a593SmuzhiyunNon-breaking spaces and other special Unicode space characters are no
665*4882a593Smuzhiyunlonger folded to ASCII spaces. (Robert Leftwich)
666*4882a593Smuzhiyun
667*4882a593SmuzhiyunInformation inside a TEXTAREA tag is now parsed literally, not as HTML
668*4882a593Smuzhiyuntags. TEXTAREA now works exactly the same way as SCRIPT. (Zephyr Fang)
669*4882a593Smuzhiyun
670*4882a593Smuzhiyun= 3.0.4 =
671*4882a593Smuzhiyun
672*4882a593SmuzhiyunFixed a bug that crashed Unicode conversion in some cases.
673*4882a593Smuzhiyun
674*4882a593SmuzhiyunFixed a bug that prevented UnicodeDammit from being used as a
675*4882a593Smuzhiyungeneral-purpose data scrubber.
676*4882a593Smuzhiyun
677*4882a593SmuzhiyunFixed some unit test failures when running against Python 2.5.
678*4882a593Smuzhiyun
679*4882a593SmuzhiyunWhen considering whether to convert smart quotes, UnicodeDammit now
680*4882a593Smuzhiyunlooks at the original encoding in a case-insensitive way.
681*4882a593Smuzhiyun
682*4882a593Smuzhiyun= 3.0.3 (20060606) =
683*4882a593Smuzhiyun
684*4882a593SmuzhiyunBeautiful Soup is now usable as a way to clean up invalid XML/HTML (be
685*4882a593Smuzhiyunsure to pass in an appropriate value for convertEntities, or XML/HTML
686*4882a593Smuzhiyunentities might stick around that aren't valid in HTML/XML). The result
687*4882a593Smuzhiyunmay not validate, but it should be good enough to not choke a
688*4882a593Smuzhiyunreal-world XML parser. Specifically, the output of a properly
689*4882a593Smuzhiyunconstructed soup object should always be valid as part of an XML
690*4882a593Smuzhiyundocument, but parts may be missing if they were missing in the
691*4882a593Smuzhiyunoriginal. As always, if the input is valid XML, the output will also
692*4882a593Smuzhiyunbe valid.
693*4882a593Smuzhiyun
694*4882a593Smuzhiyun= 3.0.2 (20060602) =
695*4882a593Smuzhiyun
696*4882a593SmuzhiyunPreviously, Beautiful Soup correctly handled attribute values that
697*4882a593Smuzhiyuncontained embedded quotes (sometimes by escaping), but not other kinds
698*4882a593Smuzhiyunof XML character. Now, it correctly handles or escapes all special XML
699*4882a593Smuzhiyuncharacters in attribute values.
700*4882a593Smuzhiyun
701*4882a593SmuzhiyunI aliased methods to the 2.x names (fetch, find, findText, etc.) for
702*4882a593Smuzhiyunbackwards compatibility purposes. Those names are deprecated and if I
703*4882a593Smuzhiyunever do a 4.0 I will remove them. I will, I tell you!
704*4882a593Smuzhiyun
705*4882a593SmuzhiyunFixed a bug where the findAll method wasn't passing along any keyword
706*4882a593Smuzhiyunarguments.
707*4882a593Smuzhiyun
708*4882a593SmuzhiyunWhen run from the command line, Beautiful Soup now acts as an HTML
709*4882a593Smuzhiyunpretty-printer, not an XML pretty-printer.
710*4882a593Smuzhiyun
711*4882a593Smuzhiyun= 3.0.1 (20060530) =
712*4882a593Smuzhiyun
713*4882a593SmuzhiyunReintroduced the "fetch by CSS class" shortcut. I thought keyword
714*4882a593Smuzhiyunarguments would replace it, but they don't. You can't call soup('a',
715*4882a593Smuzhiyunclass='foo') because class is a Python keyword.
716*4882a593Smuzhiyun
717*4882a593SmuzhiyunIf Beautiful Soup encounters a meta tag that declares the encoding,
718*4882a593Smuzhiyunbut a SoupStrainer tells it not to parse that tag, Beautiful Soup will
719*4882a593Smuzhiyunno longer try to rewrite the meta tag to mention the new
720*4882a593Smuzhiyunencoding. Basically, this makes SoupStrainers work in real-world
721*4882a593Smuzhiyunapplications instead of crashing the parser.
722*4882a593Smuzhiyun
723*4882a593Smuzhiyun= 3.0.0 "Who would not give all else for two p" (20060528) =
724*4882a593Smuzhiyun
725*4882a593SmuzhiyunThis release is not backward-compatible with previous releases. If
726*4882a593Smuzhiyunyou've got code written with a previous version of the library, go
727*4882a593Smuzhiyunahead and keep using it, unless one of the features mentioned here
728*4882a593Smuzhiyunreally makes your life easier. Since the library is self-contained,
729*4882a593Smuzhiyunyou can include an old copy of the library in your old applications,
730*4882a593Smuzhiyunand use the new version for everything else.
731*4882a593Smuzhiyun
732*4882a593SmuzhiyunThe documentation has been rewritten and greatly expanded with many
733*4882a593Smuzhiyunmore examples.
734*4882a593Smuzhiyun
735*4882a593SmuzhiyunBeautiful Soup autodetects the encoding of a document (or uses the one
736*4882a593Smuzhiyunyou specify), and converts it from its native encoding to
737*4882a593SmuzhiyunUnicode. Internally, it only deals with Unicode strings. When you
738*4882a593Smuzhiyunprint out the document, it converts to UTF-8 (or another encoding you
739*4882a593Smuzhiyunspecify). [Doc reference]
740*4882a593Smuzhiyun
741*4882a593SmuzhiyunIt's now easy to make large-scale changes to the parse tree without
742*4882a593Smuzhiyunscrewing up the navigation members. The methods are extract,
743*4882a593SmuzhiyunreplaceWith, and insert. [Doc reference. See also Improving Memory
744*4882a593SmuzhiyunUsage with extract]
745*4882a593Smuzhiyun
746*4882a593SmuzhiyunPassing True in as an attribute value gives you tags that have any
747*4882a593Smuzhiyunvalue for that attribute. You don't have to create a regular
748*4882a593Smuzhiyunexpression. Passing None for an attribute value gives you tags that
749*4882a593Smuzhiyundon't have that attribute at all.
750*4882a593Smuzhiyun
751*4882a593SmuzhiyunTag objects now know whether or not they're self-closing. This avoids
752*4882a593Smuzhiyunthe problem where Beautiful Soup thought that tags like <BR /> were
753*4882a593Smuzhiyunself-closing even in XML documents. You can customize the self-closing
754*4882a593Smuzhiyuntags for a parser object by passing them in as a list of
755*4882a593SmuzhiyunselfClosingTags: you don't have to subclass anymore.
756*4882a593Smuzhiyun
757*4882a593SmuzhiyunThere's a new built-in parser, MinimalSoup, which has most of
758*4882a593SmuzhiyunBeautifulSoup's HTML-specific rules, but no tag nesting rules. [Doc
759*4882a593Smuzhiyunreference]
760*4882a593Smuzhiyun
761*4882a593SmuzhiyunYou can use a SoupStrainer to tell Beautiful Soup to parse only part
762*4882a593Smuzhiyunof a document. This saves time and memory, often making Beautiful Soup
763*4882a593Smuzhiyunabout as fast as a custom-built SGMLParser subclass. [Doc reference,
764*4882a593SmuzhiyunSoupStrainer reference]
765*4882a593Smuzhiyun
766*4882a593SmuzhiyunYou can (usually) use keyword arguments instead of passing a
767*4882a593Smuzhiyundictionary of attributes to a search method. That is, you can replace
768*4882a593Smuzhiyunsoup(args={"id" : "5"}) with soup(id="5"). You can still use args if
769*4882a593Smuzhiyun(for instance) you need to find an attribute whose name clashes with
770*4882a593Smuzhiyunthe name of an argument to findAll. [Doc reference: **kwargs attrs]
771*4882a593Smuzhiyun
772*4882a593SmuzhiyunThe method names have changed to the better method names used in
773*4882a593SmuzhiyunRubyful Soup. Instead of find methods and fetch methods, there are
774*4882a593Smuzhiyunonly find methods. Instead of a scheme where you can't remember which
775*4882a593Smuzhiyunmethod finds one element and which one finds them all, we have find
776*4882a593Smuzhiyunand findAll. In general, if the method name mentions All or a plural
777*4882a593Smuzhiyunnoun (eg. findNextSiblings), then it finds many elements
778*4882a593Smuzhiyunmethod. Otherwise, it only finds one element. [Doc reference]
779*4882a593Smuzhiyun
780*4882a593SmuzhiyunSome of the argument names have been renamed for clarity. For instance
781*4882a593SmuzhiyunavoidParserProblems is now parserMassage.
782*4882a593Smuzhiyun
783*4882a593SmuzhiyunBeautiful Soup no longer implements a feed method. You need to pass a
784*4882a593Smuzhiyunstring or a filehandle into the soup constructor, not with feed after
785*4882a593Smuzhiyunthe soup has been created. There is still a feed method, but it's the
786*4882a593Smuzhiyunfeed method implemented by SGMLParser and calling it will bypass
787*4882a593SmuzhiyunBeautiful Soup and cause problems.
788*4882a593Smuzhiyun
789*4882a593SmuzhiyunThe NavigableText class has been renamed to NavigableString. There is
790*4882a593Smuzhiyunno NavigableUnicodeString anymore, because every string inside a
791*4882a593SmuzhiyunBeautiful Soup parse tree is a Unicode string.
792*4882a593Smuzhiyun
793*4882a593SmuzhiyunfindText and fetchText are gone. Just pass a text argument into find
794*4882a593Smuzhiyunor findAll.
795*4882a593Smuzhiyun
796*4882a593SmuzhiyunNull was more trouble than it was worth, so I got rid of it. Anything
797*4882a593Smuzhiyunthat used to return Null now returns None.
798*4882a593Smuzhiyun
799*4882a593SmuzhiyunSpecial XML constructs like comments and CDATA now have their own
800*4882a593SmuzhiyunNavigableString subclasses, instead of being treated as oddly-formed
801*4882a593Smuzhiyundata. If you parse a document that contains CDATA and write it back
802*4882a593Smuzhiyunout, the CDATA will still be there.
803*4882a593Smuzhiyun
804*4882a593SmuzhiyunWhen you're parsing a document, you can get Beautiful Soup to convert
805*4882a593SmuzhiyunXML or HTML entities into the corresponding Unicode characters. [Doc
806*4882a593Smuzhiyunreference]
807*4882a593Smuzhiyun
808*4882a593Smuzhiyun= 2.1.1 (20050918) =
809*4882a593Smuzhiyun
810*4882a593SmuzhiyunFixed a serious performance bug in BeautifulStoneSoup which was
811*4882a593Smuzhiyuncausing parsing to be incredibly slow.
812*4882a593Smuzhiyun
813*4882a593SmuzhiyunCorrected several entities that were previously being incorrectly
814*4882a593Smuzhiyuntranslated from Microsoft smart-quote-like characters.
815*4882a593Smuzhiyun
816*4882a593SmuzhiyunFixed a bug that was breaking text fetch.
817*4882a593Smuzhiyun
818*4882a593SmuzhiyunFixed a bug that crashed the parser when text chunks that look like
819*4882a593SmuzhiyunHTML tag names showed up within a SCRIPT tag.
820*4882a593Smuzhiyun
821*4882a593SmuzhiyunTHEAD, TBODY, and TFOOT tags are now nestable within TABLE
822*4882a593Smuzhiyuntags. Nested tables should parse more sensibly now.
823*4882a593Smuzhiyun
824*4882a593SmuzhiyunBASE is now considered a self-closing tag.
825*4882a593Smuzhiyun
826*4882a593Smuzhiyun= 2.1.0 "Game, or any other dish?" (20050504) =
827*4882a593Smuzhiyun
828*4882a593SmuzhiyunAdded a wide variety of new search methods which, given a starting
829*4882a593Smuzhiyunpoint inside the tree, follow a particular navigation member (like
830*4882a593SmuzhiyunnextSibling) over and over again, looking for Tag and NavigableText
831*4882a593Smuzhiyunobjects that match certain criteria. The new methods are findNext,
832*4882a593SmuzhiyunfetchNext, findPrevious, fetchPrevious, findNextSibling,
833*4882a593SmuzhiyunfetchNextSiblings, findPreviousSibling, fetchPreviousSiblings,
834*4882a593SmuzhiyunfindParent, and fetchParents. All of these use the same basic code
835*4882a593Smuzhiyunused by first and fetch, so you can pass your weird ways of matching
836*4882a593Smuzhiyunthings into these methods.
837*4882a593Smuzhiyun
838*4882a593SmuzhiyunThe fetch method and its derivatives now accept a limit argument.
839*4882a593Smuzhiyun
840*4882a593SmuzhiyunYou can now pass keyword arguments when calling a Tag object as though
841*4882a593Smuzhiyunit were a method.
842*4882a593Smuzhiyun
843*4882a593SmuzhiyunFixed a bug that caused all hand-created tags to share a single set of
844*4882a593Smuzhiyunattributes.
845*4882a593Smuzhiyun
846*4882a593Smuzhiyun= 2.0.3 (20050501) =
847*4882a593Smuzhiyun
848*4882a593SmuzhiyunFixed Python 2.2 support for iterators.
849*4882a593Smuzhiyun
850*4882a593SmuzhiyunFixed a bug that gave the wrong representation to tags within quote
851*4882a593Smuzhiyuntags like <script>.
852*4882a593Smuzhiyun
853*4882a593SmuzhiyunTook some code from Mark Pilgrim that treats CDATA declarations as
854*4882a593Smuzhiyundata instead of ignoring them.
855*4882a593Smuzhiyun
856*4882a593SmuzhiyunBeautiful Soup's setup.py will now do an install even if the unit
857*4882a593Smuzhiyuntests fail. It won't build a source distribution if the unit tests
858*4882a593Smuzhiyunfail, so I can't release a new version unless they pass.
859*4882a593Smuzhiyun
860*4882a593Smuzhiyun= 2.0.2 (20050416) =
861*4882a593Smuzhiyun
862*4882a593SmuzhiyunAdded the unit tests in a separate module, and packaged it with
863*4882a593Smuzhiyundistutils.
864*4882a593Smuzhiyun
865*4882a593SmuzhiyunFixed a bug that sometimes caused renderContents() to return a Unicode
866*4882a593Smuzhiyunstring even if there was no Unicode in the original string.
867*4882a593Smuzhiyun
868*4882a593SmuzhiyunAdded the done() method, which closes all of the parser's open
869*4882a593Smuzhiyuntags. It gets called automatically when you pass in some text to the
870*4882a593Smuzhiyunconstructor of a parser class; otherwise you must call it yourself.
871*4882a593Smuzhiyun
872*4882a593SmuzhiyunReinstated some backwards compatibility with 1.x versions: referencing
873*4882a593Smuzhiyunthe string member of a NavigableText object returns the NavigableText
874*4882a593Smuzhiyunobject instead of throwing an error.
875*4882a593Smuzhiyun
876*4882a593Smuzhiyun= 2.0.1 (20050412) =
877*4882a593Smuzhiyun
878*4882a593SmuzhiyunFixed a bug that caused bad results when you tried to reference a tag
879*4882a593Smuzhiyunname shorter than 3 characters as a member of a Tag, eg. tag.table.td.
880*4882a593Smuzhiyun
881*4882a593SmuzhiyunMade sure all Tags have the 'hidden' attribute so that an attempt to
882*4882a593Smuzhiyunaccess tag.hidden doesn't spawn an attempt to find a tag named
883*4882a593Smuzhiyun'hidden'.
884*4882a593Smuzhiyun
885*4882a593SmuzhiyunFixed a bug in the comparison operator.
886*4882a593Smuzhiyun
887*4882a593Smuzhiyun= 2.0.0 "Who cares for fish?" (20050410)
888*4882a593Smuzhiyun
889*4882a593SmuzhiyunBeautiful Soup version 1 was very useful but also pretty stupid. I
890*4882a593Smuzhiyunoriginally wrote it without noticing any of the problems inherent in
891*4882a593Smuzhiyuntrying to build a parse tree out of ambiguous HTML tags. This version
892*4882a593Smuzhiyunsolves all of those problems to my satisfaction. It also adds many new
893*4882a593Smuzhiyunclever things to make up for the removal of the stupid things.
894*4882a593Smuzhiyun
895*4882a593Smuzhiyun== Parsing ==
896*4882a593Smuzhiyun
897*4882a593SmuzhiyunThe parser logic has been greatly improved, and the BeautifulSoup
898*4882a593Smuzhiyunclass should much more reliably yield a parse tree that looks like
899*4882a593Smuzhiyunwhat the page author intended. For a particular class of odd edge
900*4882a593Smuzhiyuncases that now causes problems, there is a new class,
901*4882a593SmuzhiyunICantBelieveItsBeautifulSoup.
902*4882a593Smuzhiyun
903*4882a593SmuzhiyunBy default, Beautiful Soup now performs some cleanup operations on
904*4882a593Smuzhiyuntext before parsing it. This is to avoid common problems with bad
905*4882a593Smuzhiyundefinitions and self-closing tags that crash SGMLParser. You can
906*4882a593Smuzhiyunprovide your own set of cleanup operations, or turn it off
907*4882a593Smuzhiyunaltogether. The cleanup operations include fixing self-closing tags
908*4882a593Smuzhiyunthat don't close, and replacing Microsoft smart quotes and similar
909*4882a593Smuzhiyuncharacters with their HTML entity equivalents.
910*4882a593Smuzhiyun
911*4882a593SmuzhiyunYou can now get a pretty-print version of parsed HTML to get a visual
912*4882a593Smuzhiyunpicture of how Beautiful Soup parses it, with the Tag.prettify()
913*4882a593Smuzhiyunmethod.
914*4882a593Smuzhiyun
915*4882a593Smuzhiyun== Strings and Unicode ==
916*4882a593Smuzhiyun
917*4882a593SmuzhiyunThere are separate NavigableText subclasses for ASCII and Unicode
918*4882a593Smuzhiyunstrings. These classes directly subclass the corresponding base data
919*4882a593Smuzhiyuntypes. This means you can treat NavigableText objects as strings
920*4882a593Smuzhiyuninstead of having to call methods on them to get the strings.
921*4882a593Smuzhiyun
922*4882a593Smuzhiyunstr() on a Tag always returns a string, and unicode() always returns
923*4882a593SmuzhiyunUnicode. Previously it was inconsistent.
924*4882a593Smuzhiyun
925*4882a593Smuzhiyun== Tree traversal ==
926*4882a593Smuzhiyun
927*4882a593SmuzhiyunIn a first() or fetch() call, the tag name or the desired value of an
928*4882a593Smuzhiyunattribute can now be any of the following:
929*4882a593Smuzhiyun
930*4882a593Smuzhiyun * A string (matches that specific tag or that specific attribute value)
931*4882a593Smuzhiyun * A list of strings (matches any tag or attribute value in the list)
932*4882a593Smuzhiyun * A compiled regular expression object (matches any tag or attribute
933*4882a593Smuzhiyun   value that matches the regular expression)
934*4882a593Smuzhiyun * A callable object that takes the Tag object or attribute value as a
935*4882a593Smuzhiyun   string. It returns None/false/empty string if the given string
936*4882a593Smuzhiyun   doesn't match, and any other value if it does.
937*4882a593Smuzhiyun
938*4882a593SmuzhiyunThis is much easier to use than SQL-style wildcards (see, regular
939*4882a593Smuzhiyunexpressions are good for something). Because of this, I took out
940*4882a593SmuzhiyunSQL-style wildcards. I'll put them back if someone complains, but
941*4882a593Smuzhiyuntheir removal simplifies the code a lot.
942*4882a593Smuzhiyun
943*4882a593SmuzhiyunYou can use fetch() and first() to search for text in the parse tree,
944*4882a593Smuzhiyunnot just tags. There are new alias methods fetchText() and firstText()
945*4882a593Smuzhiyundesigned for this purpose. As with searching for tags, you can pass in
946*4882a593Smuzhiyuna string, a regular expression object, or a method to match your text.
947*4882a593Smuzhiyun
948*4882a593SmuzhiyunIf you pass in something besides a map to the attrs argument of
949*4882a593Smuzhiyunfetch() or first(), Beautiful Soup will assume you want to match that
950*4882a593Smuzhiyunthing against the "class" attribute. When you're scraping
951*4882a593Smuzhiyunwell-structured HTML, this makes your code a lot cleaner.
952*4882a593Smuzhiyun
953*4882a593Smuzhiyun1.x and 2.x both let you call a Tag object as a shorthand for
954*4882a593Smuzhiyunfetch(). For instance, foo("bar") is a shorthand for
955*4882a593Smuzhiyunfoo.fetch("bar"). In 2.x, you can also access a specially-named member
956*4882a593Smuzhiyunof a Tag object as a shorthand for first(). For instance, foo.barTag
957*4882a593Smuzhiyunis a shorthand for foo.first("bar"). By chaining these shortcuts you
958*4882a593Smuzhiyuntraverse a tree in very little code: for header in
959*4882a593Smuzhiyunsoup.bodyTag.pTag.tableTag('th'):
960*4882a593Smuzhiyun
961*4882a593SmuzhiyunIf an element relationship (like parent or next) doesn't apply to a
962*4882a593Smuzhiyuntag, it'll now show up Null instead of None. first() will also return
963*4882a593SmuzhiyunNull if you ask it for a nonexistent tag. Null is an object that's
964*4882a593Smuzhiyunjust like None, except you can do whatever you want to it and it'll
965*4882a593Smuzhiyungive you Null instead of throwing an error.
966*4882a593Smuzhiyun
967*4882a593SmuzhiyunThis lets you do tree traversals like soup.htmlTag.headTag.titleTag
968*4882a593Smuzhiyunwithout having to worry if the intermediate stages are actually
969*4882a593Smuzhiyunthere. Previously, if there was no 'head' tag in the document, headTag
970*4882a593Smuzhiyunin that instance would have been None, and accessing its 'titleTag'
971*4882a593Smuzhiyunmember would have thrown an AttributeError. Now, you can get what you
972*4882a593Smuzhiyunwant when it exists, and get Null when it doesn't, without having to
973*4882a593Smuzhiyundo a lot of conditionals checking to see if every stage is None.
974*4882a593Smuzhiyun
975*4882a593SmuzhiyunThere are two new relations between page elements: previousSibling and
976*4882a593SmuzhiyunnextSibling. They reference the previous and next element at the same
977*4882a593Smuzhiyunlevel of the parse tree. For instance, if you have HTML like this:
978*4882a593Smuzhiyun
979*4882a593Smuzhiyun  <p><ul><li>Foo<br /><li>Bar</ul>
980*4882a593Smuzhiyun
981*4882a593SmuzhiyunThe first 'li' tag has a previousSibling of Null and its nextSibling
982*4882a593Smuzhiyunis the second 'li' tag. The second 'li' tag has a nextSibling of Null
983*4882a593Smuzhiyunand its previousSibling is the first 'li' tag. The previousSibling of
984*4882a593Smuzhiyunthe 'ul' tag is the first 'p' tag. The nextSibling of 'Foo' is the
985*4882a593Smuzhiyun'br' tag.
986*4882a593Smuzhiyun
987*4882a593SmuzhiyunI took out the ability to use fetch() to find tags that have a
988*4882a593Smuzhiyunspecific list of contents. See, I can't even explain it well. It was
989*4882a593Smuzhiyunreally difficult to use, I never used it, and I don't think anyone
990*4882a593Smuzhiyunelse ever used it. To the extent anyone did, they can probably use
991*4882a593SmuzhiyunfetchText() instead. If it turns out someone needs it I'll think of
992*4882a593Smuzhiyunanother solution.
993*4882a593Smuzhiyun
994*4882a593Smuzhiyun== Tree manipulation ==
995*4882a593Smuzhiyun
996*4882a593SmuzhiyunYou can add new attributes to a tag, and delete attributes from a
997*4882a593Smuzhiyuntag. In 1.x you could only change a tag's existing attributes.
998*4882a593Smuzhiyun
999*4882a593Smuzhiyun== Porting Considerations ==
1000*4882a593Smuzhiyun
1001*4882a593SmuzhiyunThere are three changes in 2.0 that break old code:
1002*4882a593Smuzhiyun
1003*4882a593SmuzhiyunIn the post-1.2 release you could pass in a function into fetch(). The
1004*4882a593Smuzhiyunfunction took a string, the tag name. In 2.0, the function takes the
1005*4882a593Smuzhiyunactual Tag object.
1006*4882a593Smuzhiyun
1007*4882a593SmuzhiyunIt's no longer to pass in SQL-style wildcards to fetch(). Use a
1008*4882a593Smuzhiyunregular expression instead.
1009*4882a593Smuzhiyun
1010*4882a593SmuzhiyunThe different parsing algorithm means the parse tree may not be shaped
1011*4882a593Smuzhiyunlike you expect. This will only actually affect you if your code uses
1012*4882a593Smuzhiyunone of the affected parts. I haven't run into this problem yet while
1013*4882a593Smuzhiyunporting my code.
1014*4882a593Smuzhiyun
1015*4882a593Smuzhiyun= Between 1.2 and 2.0 =
1016*4882a593Smuzhiyun
1017*4882a593SmuzhiyunThis is the release to get if you want Python 1.5 compatibility.
1018*4882a593Smuzhiyun
1019*4882a593SmuzhiyunThe desired value of an attribute can now be any of the following:
1020*4882a593Smuzhiyun
1021*4882a593Smuzhiyun * A string
1022*4882a593Smuzhiyun * A string with SQL-style wildcards
1023*4882a593Smuzhiyun * A compiled RE object
1024*4882a593Smuzhiyun * A callable that returns None/false/empty string if the given value
1025*4882a593Smuzhiyun   doesn't match, and any other value otherwise.
1026*4882a593Smuzhiyun
1027*4882a593SmuzhiyunThis is much easier to use than SQL-style wildcards (see, regular
1028*4882a593Smuzhiyunexpressions are good for something). Because of this, I no longer
1029*4882a593Smuzhiyunrecommend you use SQL-style wildcards. They may go away in a future
1030*4882a593Smuzhiyunrelease to clean up the code.
1031*4882a593Smuzhiyun
1032*4882a593SmuzhiyunMade Beautiful Soup handle processing instructions as text instead of
1033*4882a593Smuzhiyunignoring them.
1034*4882a593Smuzhiyun
1035*4882a593SmuzhiyunApplied patch from Richie Hindle (richie at entrian dot com) that
1036*4882a593Smuzhiyunmakes tag.string a shorthand for tag.contents[0].string when the tag
1037*4882a593Smuzhiyunhas only one string-owning child.
1038*4882a593Smuzhiyun
1039*4882a593SmuzhiyunAdded still more nestable tags. The nestable tags thing won't work in
1040*4882a593Smuzhiyuna lot of cases and needs to be rethought.
1041*4882a593Smuzhiyun
1042*4882a593SmuzhiyunFixed an edge case where searching for "%foo" would match any string
1043*4882a593Smuzhiyunshorter than "foo".
1044*4882a593Smuzhiyun
1045*4882a593Smuzhiyun= 1.2 "Who for such dainties would not stoop?" (20040708) =
1046*4882a593Smuzhiyun
1047*4882a593SmuzhiyunApplied patch from Ben Last (ben at benlast dot com) that made
1048*4882a593SmuzhiyunTag.renderContents() correctly handle Unicode.
1049*4882a593Smuzhiyun
1050*4882a593SmuzhiyunMade BeautifulStoneSoup even dumber by making it not implicitly close
1051*4882a593Smuzhiyuna tag when another tag of the same type is encountered; only when an
1052*4882a593Smuzhiyunactual closing tag is encountered. This change courtesy of Fuzzy (mike
1053*4882a593Smuzhiyunat pcblokes dot com). BeautifulSoup still works as before.
1054*4882a593Smuzhiyun
1055*4882a593Smuzhiyun= 1.1 "Swimming in a hot tureen" =
1056*4882a593Smuzhiyun
1057*4882a593SmuzhiyunAdded more 'nestable' tags. Changed popping semantics so that when a
1058*4882a593Smuzhiyunnestable tag is encountered, tags are popped up to the previously
1059*4882a593Smuzhiyunencountered nestable tag (of whatever kind). I will revert this if
1060*4882a593Smuzhiyunenough people complain, but it should make more people's lives easier
1061*4882a593Smuzhiyunthan harder. This enhancement was suggested by Anthony Baxter (anthony
1062*4882a593Smuzhiyunat interlink dot com dot au).
1063*4882a593Smuzhiyun
1064*4882a593Smuzhiyun= 1.0 "So rich and green" (20040420) =
1065*4882a593Smuzhiyun
1066*4882a593SmuzhiyunInitial release.
1067