1*4882a593Smuzhiyun= 4.3.2 (20131002) = 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun* Fixed a bug in which short Unicode input was improperly encoded to 4*4882a593Smuzhiyun ASCII when checking whether or not it was the name of a file on 5*4882a593Smuzhiyun disk. [bug=1227016] 6*4882a593Smuzhiyun 7*4882a593Smuzhiyun* Fixed a crash when a short input contains data not valid in 8*4882a593Smuzhiyun filenames. [bug=1232604] 9*4882a593Smuzhiyun 10*4882a593Smuzhiyun* Fixed a bug that caused Unicode data put into UnicodeDammit to 11*4882a593Smuzhiyun return None instead of the original data. [bug=1214983] 12*4882a593Smuzhiyun 13*4882a593Smuzhiyun* Combined two tests to stop a spurious test failure when tests are 14*4882a593Smuzhiyun run by nosetests. [bug=1212445] 15*4882a593Smuzhiyun 16*4882a593Smuzhiyun= 4.3.1 (20130815) = 17*4882a593Smuzhiyun 18*4882a593Smuzhiyun* Fixed yet another problem with the html5lib tree builder, caused by 19*4882a593Smuzhiyun html5lib's tendency to rearrange the tree during 20*4882a593Smuzhiyun parsing. [bug=1189267] 21*4882a593Smuzhiyun 22*4882a593Smuzhiyun* Fixed a bug that caused the optimized version of find_all() to 23*4882a593Smuzhiyun return nothing. [bug=1212655] 24*4882a593Smuzhiyun 25*4882a593Smuzhiyun= 4.3.0 (20130812) = 26*4882a593Smuzhiyun 27*4882a593Smuzhiyun* Instead of converting incoming data to Unicode and feeding it to the 28*4882a593Smuzhiyun lxml tree builder in chunks, Beautiful Soup now makes successive 29*4882a593Smuzhiyun guesses at the encoding of the incoming data, and tells lxml to 30*4882a593Smuzhiyun parse the data as that encoding. Giving lxml more control over the 31*4882a593Smuzhiyun parsing process improves performance and avoids a number of bugs and 32*4882a593Smuzhiyun issues with the lxml parser which had previously required elaborate 33*4882a593Smuzhiyun workarounds: 34*4882a593Smuzhiyun 35*4882a593Smuzhiyun - An issue in which lxml refuses to parse Unicode strings on some 36*4882a593Smuzhiyun systems. [bug=1180527] 37*4882a593Smuzhiyun 38*4882a593Smuzhiyun - A returning bug that truncated documents longer than a (very 39*4882a593Smuzhiyun small) size. [bug=963880] 40*4882a593Smuzhiyun 41*4882a593Smuzhiyun - A returning bug in which extra spaces were added to a document if 42*4882a593Smuzhiyun the document defined a charset other than UTF-8. [bug=972466] 43*4882a593Smuzhiyun 44*4882a593Smuzhiyun This required a major overhaul of the tree builder architecture. If 45*4882a593Smuzhiyun you wrote your own tree builder and didn't tell me, you'll need to 46*4882a593Smuzhiyun modify your prepare_markup() method. 47*4882a593Smuzhiyun 48*4882a593Smuzhiyun* The UnicodeDammit code that makes guesses at encodings has been 49*4882a593Smuzhiyun split into its own class, EncodingDetector. A lot of apparently 50*4882a593Smuzhiyun redundant code has been removed from Unicode, Dammit, and some 51*4882a593Smuzhiyun undocumented features have also been removed. 52*4882a593Smuzhiyun 53*4882a593Smuzhiyun* Beautiful Soup will issue a warning if instead of markup you pass it 54*4882a593Smuzhiyun a URL or the name of a file on disk (a common beginner's mistake). 55*4882a593Smuzhiyun 56*4882a593Smuzhiyun* A number of optimizations improve the performance of the lxml tree 57*4882a593Smuzhiyun builder by about 33%, the html.parser tree builder by about 20%, and 58*4882a593Smuzhiyun the html5lib tree builder by about 15%. 59*4882a593Smuzhiyun 60*4882a593Smuzhiyun* All find_all calls should now return a ResultSet object. Patch by 61*4882a593Smuzhiyun Aaron DeVore. [bug=1194034] 62*4882a593Smuzhiyun 63*4882a593Smuzhiyun= 4.2.1 (20130531) = 64*4882a593Smuzhiyun 65*4882a593Smuzhiyun* The default XML formatter will now replace ampersands even if they 66*4882a593Smuzhiyun appear to be part of entities. That is, "<" will become 67*4882a593Smuzhiyun "&lt;". The old code was left over from Beautiful Soup 3, which 68*4882a593Smuzhiyun didn't always turn entities into Unicode characters. 69*4882a593Smuzhiyun 70*4882a593Smuzhiyun If you really want the old behavior (maybe because you add new 71*4882a593Smuzhiyun strings to the tree, those strings include entities, and you want 72*4882a593Smuzhiyun the formatter to leave them alone on output), it can be found in 73*4882a593Smuzhiyun EntitySubstitution.substitute_xml_containing_entities(). [bug=1182183] 74*4882a593Smuzhiyun 75*4882a593Smuzhiyun* Gave new_string() the ability to create subclasses of 76*4882a593Smuzhiyun NavigableString. [bug=1181986] 77*4882a593Smuzhiyun 78*4882a593Smuzhiyun* Fixed another bug by which the html5lib tree builder could create a 79*4882a593Smuzhiyun disconnected tree. [bug=1182089] 80*4882a593Smuzhiyun 81*4882a593Smuzhiyun* The .previous_element of a BeautifulSoup object is now always None, 82*4882a593Smuzhiyun not the last element to be parsed. [bug=1182089] 83*4882a593Smuzhiyun 84*4882a593Smuzhiyun* Fixed test failures when lxml is not installed. [bug=1181589] 85*4882a593Smuzhiyun 86*4882a593Smuzhiyun* html5lib now supports Python 3. Fixed some Python 2-specific 87*4882a593Smuzhiyun code in the html5lib test suite. [bug=1181624] 88*4882a593Smuzhiyun 89*4882a593Smuzhiyun* The html.parser treebuilder can now handle numeric attributes in 90*4882a593Smuzhiyun text when the hexidecimal name of the attribute starts with a 91*4882a593Smuzhiyun capital X. Patch by Tim Shirley. [bug=1186242] 92*4882a593Smuzhiyun 93*4882a593Smuzhiyun= 4.2.0 (20130514) = 94*4882a593Smuzhiyun 95*4882a593Smuzhiyun* The Tag.select() method now supports a much wider variety of CSS 96*4882a593Smuzhiyun selectors. 97*4882a593Smuzhiyun 98*4882a593Smuzhiyun - Added support for the adjacent sibling combinator (+) and the 99*4882a593Smuzhiyun general sibling combinator (~). Tests by "liquider". [bug=1082144] 100*4882a593Smuzhiyun 101*4882a593Smuzhiyun - The combinators (>, +, and ~) can now combine with any supported 102*4882a593Smuzhiyun selector, not just one that selects based on tag name. 103*4882a593Smuzhiyun 104*4882a593Smuzhiyun - Added limited support for the "nth-of-type" pseudo-class. Code 105*4882a593Smuzhiyun by Sven Slootweg. [bug=1109952] 106*4882a593Smuzhiyun 107*4882a593Smuzhiyun* The BeautifulSoup class is now aliased to "_s" and "_soup", making 108*4882a593Smuzhiyun it quicker to type the import statement in an interactive session: 109*4882a593Smuzhiyun 110*4882a593Smuzhiyun from bs4 import _s 111*4882a593Smuzhiyun or 112*4882a593Smuzhiyun from bs4 import _soup 113*4882a593Smuzhiyun 114*4882a593Smuzhiyun The alias may change in the future, so don't use this in code you're 115*4882a593Smuzhiyun going to run more than once. 116*4882a593Smuzhiyun 117*4882a593Smuzhiyun* Added the 'diagnose' submodule, which includes several useful 118*4882a593Smuzhiyun functions for reporting problems and doing tech support. 119*4882a593Smuzhiyun 120*4882a593Smuzhiyun - diagnose(data) tries the given markup on every installed parser, 121*4882a593Smuzhiyun reporting exceptions and displaying successes. If a parser is not 122*4882a593Smuzhiyun installed, diagnose() mentions this fact. 123*4882a593Smuzhiyun 124*4882a593Smuzhiyun - lxml_trace(data, html=True) runs the given markup through lxml's 125*4882a593Smuzhiyun XML parser or HTML parser, and prints out the parser events as 126*4882a593Smuzhiyun they happen. This helps you quickly determine whether a given 127*4882a593Smuzhiyun problem occurs in lxml code or Beautiful Soup code. 128*4882a593Smuzhiyun 129*4882a593Smuzhiyun - htmlparser_trace(data) is the same thing, but for Python's 130*4882a593Smuzhiyun built-in HTMLParser class. 131*4882a593Smuzhiyun 132*4882a593Smuzhiyun* In an HTML document, the contents of a <script> or <style> tag will 133*4882a593Smuzhiyun no longer undergo entity substitution by default. XML documents work 134*4882a593Smuzhiyun the same way they did before. [bug=1085953] 135*4882a593Smuzhiyun 136*4882a593Smuzhiyun* Methods like get_text() and properties like .strings now only give 137*4882a593Smuzhiyun you strings that are visible in the document--no comments or 138*4882a593Smuzhiyun processing commands. [bug=1050164] 139*4882a593Smuzhiyun 140*4882a593Smuzhiyun* The prettify() method now leaves the contents of <pre> tags 141*4882a593Smuzhiyun alone. [bug=1095654] 142*4882a593Smuzhiyun 143*4882a593Smuzhiyun* Fix a bug in the html5lib treebuilder which sometimes created 144*4882a593Smuzhiyun disconnected trees. [bug=1039527] 145*4882a593Smuzhiyun 146*4882a593Smuzhiyun* Fix a bug in the lxml treebuilder which crashed when a tag included 147*4882a593Smuzhiyun an attribute from the predefined "xml:" namespace. [bug=1065617] 148*4882a593Smuzhiyun 149*4882a593Smuzhiyun* Fix a bug by which keyword arguments to find_parent() were not 150*4882a593Smuzhiyun being passed on. [bug=1126734] 151*4882a593Smuzhiyun 152*4882a593Smuzhiyun* Stop a crash when unwisely messing with a tag that's been 153*4882a593Smuzhiyun decomposed. [bug=1097699] 154*4882a593Smuzhiyun 155*4882a593Smuzhiyun* Now that lxml's segfault on invalid doctype has been fixed, fixed a 156*4882a593Smuzhiyun corresponding problem on the Beautiful Soup end that was previously 157*4882a593Smuzhiyun invisible. [bug=984936] 158*4882a593Smuzhiyun 159*4882a593Smuzhiyun* Fixed an exception when an overspecified CSS selector didn't match 160*4882a593Smuzhiyun anything. Code by Stefaan Lippens. [bug=1168167] 161*4882a593Smuzhiyun 162*4882a593Smuzhiyun= 4.1.3 (20120820) = 163*4882a593Smuzhiyun 164*4882a593Smuzhiyun* Skipped a test under Python 2.6 and Python 3.1 to avoid a spurious 165*4882a593Smuzhiyun test failure caused by the lousy HTMLParser in those 166*4882a593Smuzhiyun versions. [bug=1038503] 167*4882a593Smuzhiyun 168*4882a593Smuzhiyun* Raise a more specific error (FeatureNotFound) when a requested 169*4882a593Smuzhiyun parser or parser feature is not installed. Raise NotImplementedError 170*4882a593Smuzhiyun instead of ValueError when the user calls insert_before() or 171*4882a593Smuzhiyun insert_after() on the BeautifulSoup object itself. Patch by Aaron 172*4882a593Smuzhiyun Devore. [bug=1038301] 173*4882a593Smuzhiyun 174*4882a593Smuzhiyun= 4.1.2 (20120817) = 175*4882a593Smuzhiyun 176*4882a593Smuzhiyun* As per PEP-8, allow searching by CSS class using the 'class_' 177*4882a593Smuzhiyun keyword argument. [bug=1037624] 178*4882a593Smuzhiyun 179*4882a593Smuzhiyun* Display namespace prefixes for namespaced attribute names, instead of 180*4882a593Smuzhiyun the fully-qualified names given by the lxml parser. [bug=1037597] 181*4882a593Smuzhiyun 182*4882a593Smuzhiyun* Fixed a crash on encoding when an attribute name contained 183*4882a593Smuzhiyun non-ASCII characters. 184*4882a593Smuzhiyun 185*4882a593Smuzhiyun* When sniffing encodings, if the cchardet library is installed, 186*4882a593Smuzhiyun Beautiful Soup uses it instead of chardet. cchardet is much 187*4882a593Smuzhiyun faster. [bug=1020748] 188*4882a593Smuzhiyun 189*4882a593Smuzhiyun* Use logging.warning() instead of warning.warn() to notify the user 190*4882a593Smuzhiyun that characters were replaced with REPLACEMENT 191*4882a593Smuzhiyun CHARACTER. [bug=1013862] 192*4882a593Smuzhiyun 193*4882a593Smuzhiyun= 4.1.1 (20120703) = 194*4882a593Smuzhiyun 195*4882a593Smuzhiyun* Fixed an html5lib tree builder crash which happened when html5lib 196*4882a593Smuzhiyun moved a tag with a multivalued attribute from one part of the tree 197*4882a593Smuzhiyun to another. [bug=1019603] 198*4882a593Smuzhiyun 199*4882a593Smuzhiyun* Correctly display closing tags with an XML namespace declared. Patch 200*4882a593Smuzhiyun by Andreas Kostyrka. [bug=1019635] 201*4882a593Smuzhiyun 202*4882a593Smuzhiyun* Fixed a typo that made parsing significantly slower than it should 203*4882a593Smuzhiyun have been, and also waited too long to close tags with XML 204*4882a593Smuzhiyun namespaces. [bug=1020268] 205*4882a593Smuzhiyun 206*4882a593Smuzhiyun* get_text() now returns an empty Unicode string if there is no text, 207*4882a593Smuzhiyun rather than an empty bytestring. [bug=1020387] 208*4882a593Smuzhiyun 209*4882a593Smuzhiyun= 4.1.0 (20120529) = 210*4882a593Smuzhiyun 211*4882a593Smuzhiyun* Added experimental support for fixing Windows-1252 characters 212*4882a593Smuzhiyun embedded in UTF-8 documents. (UnicodeDammit.detwingle()) 213*4882a593Smuzhiyun 214*4882a593Smuzhiyun* Fixed the handling of " with the built-in parser. [bug=993871] 215*4882a593Smuzhiyun 216*4882a593Smuzhiyun* Comments, processing instructions, document type declarations, and 217*4882a593Smuzhiyun markup declarations are now treated as preformatted strings, the way 218*4882a593Smuzhiyun CData blocks are. [bug=1001025] 219*4882a593Smuzhiyun 220*4882a593Smuzhiyun* Fixed a bug with the lxml treebuilder that prevented the user from 221*4882a593Smuzhiyun adding attributes to a tag that didn't originally have 222*4882a593Smuzhiyun attributes. [bug=1002378] Thanks to Oliver Beattie for the patch. 223*4882a593Smuzhiyun 224*4882a593Smuzhiyun* Fixed some edge-case bugs having to do with inserting an element 225*4882a593Smuzhiyun into a tag it's already inside, and replacing one of a tag's 226*4882a593Smuzhiyun children with another. [bug=997529] 227*4882a593Smuzhiyun 228*4882a593Smuzhiyun* Added the ability to search for attribute values specified in UTF-8. [bug=1003974] 229*4882a593Smuzhiyun 230*4882a593Smuzhiyun This caused a major refactoring of the search code. All the tests 231*4882a593Smuzhiyun pass, but it's possible that some searches will behave differently. 232*4882a593Smuzhiyun 233*4882a593Smuzhiyun= 4.0.5 (20120427) = 234*4882a593Smuzhiyun 235*4882a593Smuzhiyun* Added a new method, wrap(), which wraps an element in a tag. 236*4882a593Smuzhiyun 237*4882a593Smuzhiyun* Renamed replace_with_children() to unwrap(), which is easier to 238*4882a593Smuzhiyun understand and also the jQuery name of the function. 239*4882a593Smuzhiyun 240*4882a593Smuzhiyun* Made encoding substitution in <meta> tags completely transparent (no 241*4882a593Smuzhiyun more %SOUP-ENCODING%). 242*4882a593Smuzhiyun 243*4882a593Smuzhiyun* Fixed a bug in decoding data that contained a byte-order mark, such 244*4882a593Smuzhiyun as data encoded in UTF-16LE. [bug=988980] 245*4882a593Smuzhiyun 246*4882a593Smuzhiyun* Fixed a bug that made the HTMLParser treebuilder generate XML 247*4882a593Smuzhiyun definitions ending with two question marks instead of 248*4882a593Smuzhiyun one. [bug=984258] 249*4882a593Smuzhiyun 250*4882a593Smuzhiyun* Upon document generation, CData objects are no longer run through 251*4882a593Smuzhiyun the formatter. [bug=988905] 252*4882a593Smuzhiyun 253*4882a593Smuzhiyun* The test suite now passes when lxml is not installed, whether or not 254*4882a593Smuzhiyun html5lib is installed. [bug=987004] 255*4882a593Smuzhiyun 256*4882a593Smuzhiyun* Print a warning on HTMLParseErrors to let people know they should 257*4882a593Smuzhiyun install a better parser library. 258*4882a593Smuzhiyun 259*4882a593Smuzhiyun= 4.0.4 (20120416) = 260*4882a593Smuzhiyun 261*4882a593Smuzhiyun* Fixed a bug that sometimes created disconnected trees. 262*4882a593Smuzhiyun 263*4882a593Smuzhiyun* Fixed a bug with the string setter that moved a string around the 264*4882a593Smuzhiyun tree instead of copying it. [bug=983050] 265*4882a593Smuzhiyun 266*4882a593Smuzhiyun* Attribute values are now run through the provided output formatter. 267*4882a593Smuzhiyun Previously they were always run through the 'minimal' formatter. In 268*4882a593Smuzhiyun the future I may make it possible to specify different formatters 269*4882a593Smuzhiyun for attribute values and strings, but for now, consistent behavior 270*4882a593Smuzhiyun is better than inconsistent behavior. [bug=980237] 271*4882a593Smuzhiyun 272*4882a593Smuzhiyun* Added the missing renderContents method from Beautiful Soup 3. Also 273*4882a593Smuzhiyun added an encode_contents() method to go along with decode_contents(). 274*4882a593Smuzhiyun 275*4882a593Smuzhiyun* Give a more useful error when the user tries to run the Python 2 276*4882a593Smuzhiyun version of BS under Python 3. 277*4882a593Smuzhiyun 278*4882a593Smuzhiyun* UnicodeDammit can now convert Microsoft smart quotes to ASCII with 279*4882a593Smuzhiyun UnicodeDammit(markup, smart_quotes_to="ascii"). 280*4882a593Smuzhiyun 281*4882a593Smuzhiyun= 4.0.3 (20120403) = 282*4882a593Smuzhiyun 283*4882a593Smuzhiyun* Fixed a typo that caused some versions of Python 3 to convert the 284*4882a593Smuzhiyun Beautiful Soup codebase incorrectly. 285*4882a593Smuzhiyun 286*4882a593Smuzhiyun* Got rid of the 4.0.2 workaround for HTML documents--it was 287*4882a593Smuzhiyun unnecessary and the workaround was triggering a (possibly different, 288*4882a593Smuzhiyun but related) bug in lxml. [bug=972466] 289*4882a593Smuzhiyun 290*4882a593Smuzhiyun= 4.0.2 (20120326) = 291*4882a593Smuzhiyun 292*4882a593Smuzhiyun* Worked around a possible bug in lxml that prevents non-tiny XML 293*4882a593Smuzhiyun documents from being parsed. [bug=963880, bug=963936] 294*4882a593Smuzhiyun 295*4882a593Smuzhiyun* Fixed a bug where specifying `text` while also searching for a tag 296*4882a593Smuzhiyun only worked if `text` wanted an exact string match. [bug=955942] 297*4882a593Smuzhiyun 298*4882a593Smuzhiyun= 4.0.1 (20120314) = 299*4882a593Smuzhiyun 300*4882a593Smuzhiyun* This is the first official release of Beautiful Soup 4. There is no 301*4882a593Smuzhiyun 4.0.0 release, to eliminate any possibility that packaging software 302*4882a593Smuzhiyun might treat "4.0.0" as being an earlier version than "4.0.0b10". 303*4882a593Smuzhiyun 304*4882a593Smuzhiyun* Brought BS up to date with the latest release of soupselect, adding 305*4882a593Smuzhiyun CSS selector support for direct descendant matches and multiple CSS 306*4882a593Smuzhiyun class matches. 307*4882a593Smuzhiyun 308*4882a593Smuzhiyun= 4.0.0b10 (20120302) = 309*4882a593Smuzhiyun 310*4882a593Smuzhiyun* Added support for simple CSS selectors, taken from the soupselect project. 311*4882a593Smuzhiyun 312*4882a593Smuzhiyun* Fixed a crash when using html5lib. [bug=943246] 313*4882a593Smuzhiyun 314*4882a593Smuzhiyun* In HTML5-style <meta charset="foo"> tags, the value of the "charset" 315*4882a593Smuzhiyun attribute is now replaced with the appropriate encoding on 316*4882a593Smuzhiyun output. [bug=942714] 317*4882a593Smuzhiyun 318*4882a593Smuzhiyun* Fixed a bug that caused calling a tag to sometimes call find_all() 319*4882a593Smuzhiyun with the wrong arguments. [bug=944426] 320*4882a593Smuzhiyun 321*4882a593Smuzhiyun* For backwards compatibility, brought back the BeautifulStoneSoup 322*4882a593Smuzhiyun class as a deprecated wrapper around BeautifulSoup. 323*4882a593Smuzhiyun 324*4882a593Smuzhiyun= 4.0.0b9 (20120228) = 325*4882a593Smuzhiyun 326*4882a593Smuzhiyun* Fixed the string representation of DOCTYPEs that have both a public 327*4882a593Smuzhiyun ID and a system ID. 328*4882a593Smuzhiyun 329*4882a593Smuzhiyun* Fixed the generated XML declaration. 330*4882a593Smuzhiyun 331*4882a593Smuzhiyun* Renamed Tag.nsprefix to Tag.prefix, for consistency with 332*4882a593Smuzhiyun NamespacedAttribute. 333*4882a593Smuzhiyun 334*4882a593Smuzhiyun* Fixed a test failure that occured on Python 3.x when chardet was 335*4882a593Smuzhiyun installed. 336*4882a593Smuzhiyun 337*4882a593Smuzhiyun* Made prettify() return Unicode by default, so it will look nice on 338*4882a593Smuzhiyun Python 3 when passed into print(). 339*4882a593Smuzhiyun 340*4882a593Smuzhiyun= 4.0.0b8 (20120224) = 341*4882a593Smuzhiyun 342*4882a593Smuzhiyun* All tree builders now preserve namespace information in the 343*4882a593Smuzhiyun documents they parse. If you use the html5lib parser or lxml's XML 344*4882a593Smuzhiyun parser, you can access the namespace URL for a tag as tag.namespace. 345*4882a593Smuzhiyun 346*4882a593Smuzhiyun However, there is no special support for namespace-oriented 347*4882a593Smuzhiyun searching or tree manipulation. When you search the tree, you need 348*4882a593Smuzhiyun to use namespace prefixes exactly as they're used in the original 349*4882a593Smuzhiyun document. 350*4882a593Smuzhiyun 351*4882a593Smuzhiyun* The string representation of a DOCTYPE always ends in a newline. 352*4882a593Smuzhiyun 353*4882a593Smuzhiyun* Issue a warning if the user tries to use a SoupStrainer in 354*4882a593Smuzhiyun conjunction with the html5lib tree builder, which doesn't support 355*4882a593Smuzhiyun them. 356*4882a593Smuzhiyun 357*4882a593Smuzhiyun= 4.0.0b7 (20120223) = 358*4882a593Smuzhiyun 359*4882a593Smuzhiyun* Upon decoding to string, any characters that can't be represented in 360*4882a593Smuzhiyun your chosen encoding will be converted into numeric XML entity 361*4882a593Smuzhiyun references. 362*4882a593Smuzhiyun 363*4882a593Smuzhiyun* Issue a warning if characters were replaced with REPLACEMENT 364*4882a593Smuzhiyun CHARACTER during Unicode conversion. 365*4882a593Smuzhiyun 366*4882a593Smuzhiyun* Restored compatibility with Python 2.6. 367*4882a593Smuzhiyun 368*4882a593Smuzhiyun* The install process no longer installs docs or auxillary text files. 369*4882a593Smuzhiyun 370*4882a593Smuzhiyun* It's now possible to deepcopy a BeautifulSoup object created with 371*4882a593Smuzhiyun Python's built-in HTML parser. 372*4882a593Smuzhiyun 373*4882a593Smuzhiyun* About 100 unit tests that "test" the behavior of various parsers on 374*4882a593Smuzhiyun invalid markup have been removed. Legitimate changes to those 375*4882a593Smuzhiyun parsers caused these tests to fail, indicating that perhaps 376*4882a593Smuzhiyun Beautiful Soup should not test the behavior of foreign 377*4882a593Smuzhiyun libraries. 378*4882a593Smuzhiyun 379*4882a593Smuzhiyun The problematic unit tests have been reformulated as informational 380*4882a593Smuzhiyun comparisons generated by the script 381*4882a593Smuzhiyun scripts/demonstrate_parser_differences.py. 382*4882a593Smuzhiyun 383*4882a593Smuzhiyun This makes Beautiful Soup compatible with html5lib version 0.95 and 384*4882a593Smuzhiyun future versions of HTMLParser. 385*4882a593Smuzhiyun 386*4882a593Smuzhiyun= 4.0.0b6 (20120216) = 387*4882a593Smuzhiyun 388*4882a593Smuzhiyun* Multi-valued attributes like "class" always have a list of values, 389*4882a593Smuzhiyun even if there's only one value in the list. 390*4882a593Smuzhiyun 391*4882a593Smuzhiyun* Added a number of multi-valued attributes defined in HTML5. 392*4882a593Smuzhiyun 393*4882a593Smuzhiyun* Stopped generating a space before the slash that closes an 394*4882a593Smuzhiyun empty-element tag. This may come back if I add a special XHTML mode 395*4882a593Smuzhiyun (http://www.w3.org/TR/xhtml1/#C_2), but right now it's pretty 396*4882a593Smuzhiyun useless. 397*4882a593Smuzhiyun 398*4882a593Smuzhiyun* Passing text along with tag-specific arguments to a find* method: 399*4882a593Smuzhiyun 400*4882a593Smuzhiyun find("a", text="Click here") 401*4882a593Smuzhiyun 402*4882a593Smuzhiyun will find tags that contain the given text as their 403*4882a593Smuzhiyun .string. Previously, the tag-specific arguments were ignored and 404*4882a593Smuzhiyun only strings were searched. 405*4882a593Smuzhiyun 406*4882a593Smuzhiyun* Fixed a bug that caused the html5lib tree builder to build a 407*4882a593Smuzhiyun partially disconnected tree. Generally cleaned up the html5lib tree 408*4882a593Smuzhiyun builder. 409*4882a593Smuzhiyun 410*4882a593Smuzhiyun* If you restrict a multi-valued attribute like "class" to a string 411*4882a593Smuzhiyun that contains spaces, Beautiful Soup will only consider it a match 412*4882a593Smuzhiyun if the values correspond to that specific string. 413*4882a593Smuzhiyun 414*4882a593Smuzhiyun= 4.0.0b5 (20120209) = 415*4882a593Smuzhiyun 416*4882a593Smuzhiyun* Rationalized Beautiful Soup's treatment of CSS class. A tag 417*4882a593Smuzhiyun belonging to multiple CSS classes is treated as having a list of 418*4882a593Smuzhiyun values for the 'class' attribute. Searching for a CSS class will 419*4882a593Smuzhiyun match *any* of the CSS classes. 420*4882a593Smuzhiyun 421*4882a593Smuzhiyun This actually affects all attributes that the HTML standard defines 422*4882a593Smuzhiyun as taking multiple values (class, rel, rev, archive, accept-charset, 423*4882a593Smuzhiyun and headers), but 'class' is by far the most common. [bug=41034] 424*4882a593Smuzhiyun 425*4882a593Smuzhiyun* If you pass anything other than a dictionary as the second argument 426*4882a593Smuzhiyun to one of the find* methods, it'll assume you want to use that 427*4882a593Smuzhiyun object to search against a tag's CSS classes. Previously this only 428*4882a593Smuzhiyun worked if you passed in a string. 429*4882a593Smuzhiyun 430*4882a593Smuzhiyun* Fixed a bug that caused a crash when you passed a dictionary as an 431*4882a593Smuzhiyun attribute value (possibly because you mistyped "attrs"). [bug=842419] 432*4882a593Smuzhiyun 433*4882a593Smuzhiyun* Unicode, Dammit now detects the encoding in HTML 5-style <meta> tags 434*4882a593Smuzhiyun like <meta charset="utf-8" />. [bug=837268] 435*4882a593Smuzhiyun 436*4882a593Smuzhiyun* If Unicode, Dammit can't figure out a consistent encoding for a 437*4882a593Smuzhiyun page, it will try each of its guesses again, with errors="replace" 438*4882a593Smuzhiyun instead of errors="strict". This may mean that some data gets 439*4882a593Smuzhiyun replaced with REPLACEMENT CHARACTER, but at least most of it will 440*4882a593Smuzhiyun get turned into Unicode. [bug=754903] 441*4882a593Smuzhiyun 442*4882a593Smuzhiyun* Patched over a bug in html5lib (?) that was crashing Beautiful Soup 443*4882a593Smuzhiyun on certain kinds of markup. [bug=838800] 444*4882a593Smuzhiyun 445*4882a593Smuzhiyun* Fixed a bug that wrecked the tree if you replaced an element with an 446*4882a593Smuzhiyun empty string. [bug=728697] 447*4882a593Smuzhiyun 448*4882a593Smuzhiyun* Improved Unicode, Dammit's behavior when you give it Unicode to 449*4882a593Smuzhiyun begin with. 450*4882a593Smuzhiyun 451*4882a593Smuzhiyun= 4.0.0b4 (20120208) = 452*4882a593Smuzhiyun 453*4882a593Smuzhiyun* Added BeautifulSoup.new_string() to go along with BeautifulSoup.new_tag() 454*4882a593Smuzhiyun 455*4882a593Smuzhiyun* BeautifulSoup.new_tag() will follow the rules of whatever 456*4882a593Smuzhiyun tree-builder was used to create the original BeautifulSoup object. A 457*4882a593Smuzhiyun new <p> tag will look like "<p />" if the soup object was created to 458*4882a593Smuzhiyun parse XML, but it will look like "<p></p>" if the soup object was 459*4882a593Smuzhiyun created to parse HTML. 460*4882a593Smuzhiyun 461*4882a593Smuzhiyun* We pass in strict=False to html.parser on Python 3, greatly 462*4882a593Smuzhiyun improving html.parser's ability to handle bad HTML. 463*4882a593Smuzhiyun 464*4882a593Smuzhiyun* We also monkeypatch a serious bug in html.parser that made 465*4882a593Smuzhiyun strict=False disastrous on Python 3.2.2. 466*4882a593Smuzhiyun 467*4882a593Smuzhiyun* Replaced the "substitute_html_entities" argument with the 468*4882a593Smuzhiyun more general "formatter" argument. 469*4882a593Smuzhiyun 470*4882a593Smuzhiyun* Bare ampersands and angle brackets are always converted to XML 471*4882a593Smuzhiyun entities unless the user prevents it. 472*4882a593Smuzhiyun 473*4882a593Smuzhiyun* Added PageElement.insert_before() and PageElement.insert_after(), 474*4882a593Smuzhiyun which let you put an element into the parse tree with respect to 475*4882a593Smuzhiyun some other element. 476*4882a593Smuzhiyun 477*4882a593Smuzhiyun* Raise an exception when the user tries to do something nonsensical 478*4882a593Smuzhiyun like insert a tag into itself. 479*4882a593Smuzhiyun 480*4882a593Smuzhiyun 481*4882a593Smuzhiyun= 4.0.0b3 (20120203) = 482*4882a593Smuzhiyun 483*4882a593SmuzhiyunBeautiful Soup 4 is a nearly-complete rewrite that removes Beautiful 484*4882a593SmuzhiyunSoup's custom HTML parser in favor of a system that lets you write a 485*4882a593Smuzhiyunlittle glue code and plug in any HTML or XML parser you want. 486*4882a593Smuzhiyun 487*4882a593SmuzhiyunBeautiful Soup 4.0 comes with glue code for four parsers: 488*4882a593Smuzhiyun 489*4882a593Smuzhiyun * Python's standard HTMLParser (html.parser in Python 3) 490*4882a593Smuzhiyun * lxml's HTML and XML parsers 491*4882a593Smuzhiyun * html5lib's HTML parser 492*4882a593Smuzhiyun 493*4882a593SmuzhiyunHTMLParser is the default, but I recommend you install lxml if you 494*4882a593Smuzhiyuncan. 495*4882a593Smuzhiyun 496*4882a593SmuzhiyunFor complete documentation, see the Sphinx documentation in 497*4882a593Smuzhiyunbs4/doc/source/. What follows is a summary of the changes from 498*4882a593SmuzhiyunBeautiful Soup 3. 499*4882a593Smuzhiyun 500*4882a593Smuzhiyun=== The module name has changed === 501*4882a593Smuzhiyun 502*4882a593SmuzhiyunPreviously you imported the BeautifulSoup class from a module also 503*4882a593Smuzhiyuncalled BeautifulSoup. To save keystrokes and make it clear which 504*4882a593Smuzhiyunversion of the API is in use, the module is now called 'bs4': 505*4882a593Smuzhiyun 506*4882a593Smuzhiyun >>> from bs4 import BeautifulSoup 507*4882a593Smuzhiyun 508*4882a593Smuzhiyun=== It works with Python 3 === 509*4882a593Smuzhiyun 510*4882a593SmuzhiyunBeautiful Soup 3.1.0 worked with Python 3, but the parser it used was 511*4882a593Smuzhiyunso bad that it barely worked at all. Beautiful Soup 4 works with 512*4882a593SmuzhiyunPython 3, and since its parser is pluggable, you don't sacrifice 513*4882a593Smuzhiyunquality. 514*4882a593Smuzhiyun 515*4882a593SmuzhiyunSpecial thanks to Thomas Kluyver and Ezio Melotti for getting Python 3 516*4882a593Smuzhiyunsupport to the finish line. Ezio Melotti is also to thank for greatly 517*4882a593Smuzhiyunimproving the HTML parser that comes with Python 3.2. 518*4882a593Smuzhiyun 519*4882a593Smuzhiyun=== CDATA sections are normal text, if they're understood at all. === 520*4882a593Smuzhiyun 521*4882a593SmuzhiyunCurrently, the lxml and html5lib HTML parsers ignore CDATA sections in 522*4882a593Smuzhiyunmarkup: 523*4882a593Smuzhiyun 524*4882a593Smuzhiyun <p><![CDATA[foo]]></p> => <p></p> 525*4882a593Smuzhiyun 526*4882a593SmuzhiyunA future version of html5lib will turn CDATA sections into text nodes, 527*4882a593Smuzhiyunbut only within tags like <svg> and <math>: 528*4882a593Smuzhiyun 529*4882a593Smuzhiyun <svg><![CDATA[foo]]></svg> => <p>foo</p> 530*4882a593Smuzhiyun 531*4882a593SmuzhiyunThe default XML parser (which uses lxml behind the scenes) turns CDATA 532*4882a593Smuzhiyunsections into ordinary text elements: 533*4882a593Smuzhiyun 534*4882a593Smuzhiyun <p><![CDATA[foo]]></p> => <p>foo</p> 535*4882a593Smuzhiyun 536*4882a593SmuzhiyunIn theory it's possible to preserve the CDATA sections when using the 537*4882a593SmuzhiyunXML parser, but I don't see how to get it to work in practice. 538*4882a593Smuzhiyun 539*4882a593Smuzhiyun=== Miscellaneous other stuff === 540*4882a593Smuzhiyun 541*4882a593SmuzhiyunIf the BeautifulSoup instance has .is_xml set to True, an appropriate 542*4882a593SmuzhiyunXML declaration will be emitted when the tree is transformed into a 543*4882a593Smuzhiyunstring: 544*4882a593Smuzhiyun 545*4882a593Smuzhiyun <?xml version="1.0" encoding="utf-8"> 546*4882a593Smuzhiyun <markup> 547*4882a593Smuzhiyun ... 548*4882a593Smuzhiyun </markup> 549*4882a593Smuzhiyun 550*4882a593SmuzhiyunThe ['lxml', 'xml'] tree builder sets .is_xml to True; the other tree 551*4882a593Smuzhiyunbuilders set it to False. If you want to parse XHTML with an HTML 552*4882a593Smuzhiyunparser, you can set it manually. 553*4882a593Smuzhiyun 554*4882a593Smuzhiyun 555*4882a593Smuzhiyun= 3.2.0 = 556*4882a593Smuzhiyun 557*4882a593SmuzhiyunThe 3.1 series wasn't very useful, so I renamed the 3.0 series to 3.2 558*4882a593Smuzhiyunto make it obvious which one you should use. 559*4882a593Smuzhiyun 560*4882a593Smuzhiyun= 3.1.0 = 561*4882a593Smuzhiyun 562*4882a593SmuzhiyunA hybrid version that supports 2.4 and can be automatically converted 563*4882a593Smuzhiyunto run under Python 3.0. There are three backwards-incompatible 564*4882a593Smuzhiyunchanges you should be aware of, but no new features or deliberate 565*4882a593Smuzhiyunbehavior changes. 566*4882a593Smuzhiyun 567*4882a593Smuzhiyun1. str() may no longer do what you want. This is because the meaning 568*4882a593Smuzhiyunof str() inverts between Python 2 and 3; in Python 2 it gives you a 569*4882a593Smuzhiyunbyte string, in Python 3 it gives you a Unicode string. 570*4882a593Smuzhiyun 571*4882a593SmuzhiyunThe effect of this is that you can't pass an encoding to .__str__ 572*4882a593Smuzhiyunanymore. Use encode() to get a string and decode() to get Unicode, and 573*4882a593Smuzhiyunyou'll be ready (well, readier) for Python 3. 574*4882a593Smuzhiyun 575*4882a593Smuzhiyun2. Beautiful Soup is now based on HTMLParser rather than SGMLParser, 576*4882a593Smuzhiyunwhich is gone in Python 3. There's some bad HTML that SGMLParser 577*4882a593Smuzhiyunhandled but HTMLParser doesn't, usually to do with attribute values 578*4882a593Smuzhiyunthat aren't closed or have brackets inside them: 579*4882a593Smuzhiyun 580*4882a593Smuzhiyun <a href="foo</a>, </a><a href="bar">baz</a> 581*4882a593Smuzhiyun <a b="<a>">', '<a b="<a>"></a><a>"></a> 582*4882a593Smuzhiyun 583*4882a593SmuzhiyunA later version of Beautiful Soup will allow you to plug in different 584*4882a593Smuzhiyunparsers to make tradeoffs between speed and the ability to handle bad 585*4882a593SmuzhiyunHTML. 586*4882a593Smuzhiyun 587*4882a593Smuzhiyun3. In Python 3 (but not Python 2), HTMLParser converts entities within 588*4882a593Smuzhiyunattributes to the corresponding Unicode characters. In Python 2 it's 589*4882a593Smuzhiyunpossible to parse this string and leave the é intact. 590*4882a593Smuzhiyun 591*4882a593Smuzhiyun <a href="http://crummy.com?sacré&bleu"> 592*4882a593Smuzhiyun 593*4882a593SmuzhiyunIn Python 3, the é is always converted to \xe9 during 594*4882a593Smuzhiyunparsing. 595*4882a593Smuzhiyun 596*4882a593Smuzhiyun 597*4882a593Smuzhiyun= 3.0.7a = 598*4882a593Smuzhiyun 599*4882a593SmuzhiyunAdded an import that makes BS work in Python 2.3. 600*4882a593Smuzhiyun 601*4882a593Smuzhiyun 602*4882a593Smuzhiyun= 3.0.7 = 603*4882a593Smuzhiyun 604*4882a593SmuzhiyunFixed a UnicodeDecodeError when unpickling documents that contain 605*4882a593Smuzhiyunnon-ASCII characters. 606*4882a593Smuzhiyun 607*4882a593SmuzhiyunFixed a TypeError that occured in some circumstances when a tag 608*4882a593Smuzhiyuncontained no text. 609*4882a593Smuzhiyun 610*4882a593SmuzhiyunJump through hoops to avoid the use of chardet, which can be extremely 611*4882a593Smuzhiyunslow in some circumstances. UTF-8 documents should never trigger the 612*4882a593Smuzhiyunuse of chardet. 613*4882a593Smuzhiyun 614*4882a593SmuzhiyunWhitespace is preserved inside <pre> and <textarea> tags that contain 615*4882a593Smuzhiyunnothing but whitespace. 616*4882a593Smuzhiyun 617*4882a593SmuzhiyunBeautiful Soup can now parse a doctype that's scoped to an XML namespace. 618*4882a593Smuzhiyun 619*4882a593Smuzhiyun 620*4882a593Smuzhiyun= 3.0.6 = 621*4882a593Smuzhiyun 622*4882a593SmuzhiyunGot rid of a very old debug line that prevented chardet from working. 623*4882a593Smuzhiyun 624*4882a593SmuzhiyunAdded a Tag.decompose() method that completely disconnects a tree or a 625*4882a593Smuzhiyunsubset of a tree, breaking it up into bite-sized pieces that are 626*4882a593Smuzhiyuneasy for the garbage collecter to collect. 627*4882a593Smuzhiyun 628*4882a593SmuzhiyunTag.extract() now returns the tag that was extracted. 629*4882a593Smuzhiyun 630*4882a593SmuzhiyunTag.findNext() now does something with the keyword arguments you pass 631*4882a593Smuzhiyunit instead of dropping them on the floor. 632*4882a593Smuzhiyun 633*4882a593SmuzhiyunFixed a Unicode conversion bug. 634*4882a593Smuzhiyun 635*4882a593SmuzhiyunFixed a bug that garbled some <meta> tags when rewriting them. 636*4882a593Smuzhiyun 637*4882a593Smuzhiyun 638*4882a593Smuzhiyun= 3.0.5 = 639*4882a593Smuzhiyun 640*4882a593SmuzhiyunSoup objects can now be pickled, and copied with copy.deepcopy. 641*4882a593Smuzhiyun 642*4882a593SmuzhiyunTag.append now works properly on existing BS objects. (It wasn't 643*4882a593Smuzhiyunoriginally intended for outside use, but it can be now.) (Giles 644*4882a593SmuzhiyunRadford) 645*4882a593Smuzhiyun 646*4882a593SmuzhiyunPassing in a nonexistent encoding will no longer crash the parser on 647*4882a593SmuzhiyunPython 2.4 (John Nagle). 648*4882a593Smuzhiyun 649*4882a593SmuzhiyunFixed an underlying bug in SGMLParser that thinks ASCII has 255 650*4882a593Smuzhiyuncharacters instead of 127 (John Nagle). 651*4882a593Smuzhiyun 652*4882a593SmuzhiyunEntities are converted more consistently to Unicode characters. 653*4882a593Smuzhiyun 654*4882a593SmuzhiyunEntity references in attribute values are now converted to Unicode 655*4882a593Smuzhiyuncharacters when appropriate. Numeric entities are always converted, 656*4882a593Smuzhiyunbecause SGMLParser always converts them outside of attribute values. 657*4882a593Smuzhiyun 658*4882a593SmuzhiyunALL_ENTITIES happens to just be the XHTML entities, so I renamed it to 659*4882a593SmuzhiyunXHTML_ENTITIES. 660*4882a593Smuzhiyun 661*4882a593SmuzhiyunThe regular expression for bare ampersands was too loose. In some 662*4882a593Smuzhiyuncases ampersands were not being escaped. (Sam Ruby?) 663*4882a593Smuzhiyun 664*4882a593SmuzhiyunNon-breaking spaces and other special Unicode space characters are no 665*4882a593Smuzhiyunlonger folded to ASCII spaces. (Robert Leftwich) 666*4882a593Smuzhiyun 667*4882a593SmuzhiyunInformation inside a TEXTAREA tag is now parsed literally, not as HTML 668*4882a593Smuzhiyuntags. TEXTAREA now works exactly the same way as SCRIPT. (Zephyr Fang) 669*4882a593Smuzhiyun 670*4882a593Smuzhiyun= 3.0.4 = 671*4882a593Smuzhiyun 672*4882a593SmuzhiyunFixed a bug that crashed Unicode conversion in some cases. 673*4882a593Smuzhiyun 674*4882a593SmuzhiyunFixed a bug that prevented UnicodeDammit from being used as a 675*4882a593Smuzhiyungeneral-purpose data scrubber. 676*4882a593Smuzhiyun 677*4882a593SmuzhiyunFixed some unit test failures when running against Python 2.5. 678*4882a593Smuzhiyun 679*4882a593SmuzhiyunWhen considering whether to convert smart quotes, UnicodeDammit now 680*4882a593Smuzhiyunlooks at the original encoding in a case-insensitive way. 681*4882a593Smuzhiyun 682*4882a593Smuzhiyun= 3.0.3 (20060606) = 683*4882a593Smuzhiyun 684*4882a593SmuzhiyunBeautiful Soup is now usable as a way to clean up invalid XML/HTML (be 685*4882a593Smuzhiyunsure to pass in an appropriate value for convertEntities, or XML/HTML 686*4882a593Smuzhiyunentities might stick around that aren't valid in HTML/XML). The result 687*4882a593Smuzhiyunmay not validate, but it should be good enough to not choke a 688*4882a593Smuzhiyunreal-world XML parser. Specifically, the output of a properly 689*4882a593Smuzhiyunconstructed soup object should always be valid as part of an XML 690*4882a593Smuzhiyundocument, but parts may be missing if they were missing in the 691*4882a593Smuzhiyunoriginal. As always, if the input is valid XML, the output will also 692*4882a593Smuzhiyunbe valid. 693*4882a593Smuzhiyun 694*4882a593Smuzhiyun= 3.0.2 (20060602) = 695*4882a593Smuzhiyun 696*4882a593SmuzhiyunPreviously, Beautiful Soup correctly handled attribute values that 697*4882a593Smuzhiyuncontained embedded quotes (sometimes by escaping), but not other kinds 698*4882a593Smuzhiyunof XML character. Now, it correctly handles or escapes all special XML 699*4882a593Smuzhiyuncharacters in attribute values. 700*4882a593Smuzhiyun 701*4882a593SmuzhiyunI aliased methods to the 2.x names (fetch, find, findText, etc.) for 702*4882a593Smuzhiyunbackwards compatibility purposes. Those names are deprecated and if I 703*4882a593Smuzhiyunever do a 4.0 I will remove them. I will, I tell you! 704*4882a593Smuzhiyun 705*4882a593SmuzhiyunFixed a bug where the findAll method wasn't passing along any keyword 706*4882a593Smuzhiyunarguments. 707*4882a593Smuzhiyun 708*4882a593SmuzhiyunWhen run from the command line, Beautiful Soup now acts as an HTML 709*4882a593Smuzhiyunpretty-printer, not an XML pretty-printer. 710*4882a593Smuzhiyun 711*4882a593Smuzhiyun= 3.0.1 (20060530) = 712*4882a593Smuzhiyun 713*4882a593SmuzhiyunReintroduced the "fetch by CSS class" shortcut. I thought keyword 714*4882a593Smuzhiyunarguments would replace it, but they don't. You can't call soup('a', 715*4882a593Smuzhiyunclass='foo') because class is a Python keyword. 716*4882a593Smuzhiyun 717*4882a593SmuzhiyunIf Beautiful Soup encounters a meta tag that declares the encoding, 718*4882a593Smuzhiyunbut a SoupStrainer tells it not to parse that tag, Beautiful Soup will 719*4882a593Smuzhiyunno longer try to rewrite the meta tag to mention the new 720*4882a593Smuzhiyunencoding. Basically, this makes SoupStrainers work in real-world 721*4882a593Smuzhiyunapplications instead of crashing the parser. 722*4882a593Smuzhiyun 723*4882a593Smuzhiyun= 3.0.0 "Who would not give all else for two p" (20060528) = 724*4882a593Smuzhiyun 725*4882a593SmuzhiyunThis release is not backward-compatible with previous releases. If 726*4882a593Smuzhiyunyou've got code written with a previous version of the library, go 727*4882a593Smuzhiyunahead and keep using it, unless one of the features mentioned here 728*4882a593Smuzhiyunreally makes your life easier. Since the library is self-contained, 729*4882a593Smuzhiyunyou can include an old copy of the library in your old applications, 730*4882a593Smuzhiyunand use the new version for everything else. 731*4882a593Smuzhiyun 732*4882a593SmuzhiyunThe documentation has been rewritten and greatly expanded with many 733*4882a593Smuzhiyunmore examples. 734*4882a593Smuzhiyun 735*4882a593SmuzhiyunBeautiful Soup autodetects the encoding of a document (or uses the one 736*4882a593Smuzhiyunyou specify), and converts it from its native encoding to 737*4882a593SmuzhiyunUnicode. Internally, it only deals with Unicode strings. When you 738*4882a593Smuzhiyunprint out the document, it converts to UTF-8 (or another encoding you 739*4882a593Smuzhiyunspecify). [Doc reference] 740*4882a593Smuzhiyun 741*4882a593SmuzhiyunIt's now easy to make large-scale changes to the parse tree without 742*4882a593Smuzhiyunscrewing up the navigation members. The methods are extract, 743*4882a593SmuzhiyunreplaceWith, and insert. [Doc reference. See also Improving Memory 744*4882a593SmuzhiyunUsage with extract] 745*4882a593Smuzhiyun 746*4882a593SmuzhiyunPassing True in as an attribute value gives you tags that have any 747*4882a593Smuzhiyunvalue for that attribute. You don't have to create a regular 748*4882a593Smuzhiyunexpression. Passing None for an attribute value gives you tags that 749*4882a593Smuzhiyundon't have that attribute at all. 750*4882a593Smuzhiyun 751*4882a593SmuzhiyunTag objects now know whether or not they're self-closing. This avoids 752*4882a593Smuzhiyunthe problem where Beautiful Soup thought that tags like <BR /> were 753*4882a593Smuzhiyunself-closing even in XML documents. You can customize the self-closing 754*4882a593Smuzhiyuntags for a parser object by passing them in as a list of 755*4882a593SmuzhiyunselfClosingTags: you don't have to subclass anymore. 756*4882a593Smuzhiyun 757*4882a593SmuzhiyunThere's a new built-in parser, MinimalSoup, which has most of 758*4882a593SmuzhiyunBeautifulSoup's HTML-specific rules, but no tag nesting rules. [Doc 759*4882a593Smuzhiyunreference] 760*4882a593Smuzhiyun 761*4882a593SmuzhiyunYou can use a SoupStrainer to tell Beautiful Soup to parse only part 762*4882a593Smuzhiyunof a document. This saves time and memory, often making Beautiful Soup 763*4882a593Smuzhiyunabout as fast as a custom-built SGMLParser subclass. [Doc reference, 764*4882a593SmuzhiyunSoupStrainer reference] 765*4882a593Smuzhiyun 766*4882a593SmuzhiyunYou can (usually) use keyword arguments instead of passing a 767*4882a593Smuzhiyundictionary of attributes to a search method. That is, you can replace 768*4882a593Smuzhiyunsoup(args={"id" : "5"}) with soup(id="5"). You can still use args if 769*4882a593Smuzhiyun(for instance) you need to find an attribute whose name clashes with 770*4882a593Smuzhiyunthe name of an argument to findAll. [Doc reference: **kwargs attrs] 771*4882a593Smuzhiyun 772*4882a593SmuzhiyunThe method names have changed to the better method names used in 773*4882a593SmuzhiyunRubyful Soup. Instead of find methods and fetch methods, there are 774*4882a593Smuzhiyunonly find methods. Instead of a scheme where you can't remember which 775*4882a593Smuzhiyunmethod finds one element and which one finds them all, we have find 776*4882a593Smuzhiyunand findAll. In general, if the method name mentions All or a plural 777*4882a593Smuzhiyunnoun (eg. findNextSiblings), then it finds many elements 778*4882a593Smuzhiyunmethod. Otherwise, it only finds one element. [Doc reference] 779*4882a593Smuzhiyun 780*4882a593SmuzhiyunSome of the argument names have been renamed for clarity. For instance 781*4882a593SmuzhiyunavoidParserProblems is now parserMassage. 782*4882a593Smuzhiyun 783*4882a593SmuzhiyunBeautiful Soup no longer implements a feed method. You need to pass a 784*4882a593Smuzhiyunstring or a filehandle into the soup constructor, not with feed after 785*4882a593Smuzhiyunthe soup has been created. There is still a feed method, but it's the 786*4882a593Smuzhiyunfeed method implemented by SGMLParser and calling it will bypass 787*4882a593SmuzhiyunBeautiful Soup and cause problems. 788*4882a593Smuzhiyun 789*4882a593SmuzhiyunThe NavigableText class has been renamed to NavigableString. There is 790*4882a593Smuzhiyunno NavigableUnicodeString anymore, because every string inside a 791*4882a593SmuzhiyunBeautiful Soup parse tree is a Unicode string. 792*4882a593Smuzhiyun 793*4882a593SmuzhiyunfindText and fetchText are gone. Just pass a text argument into find 794*4882a593Smuzhiyunor findAll. 795*4882a593Smuzhiyun 796*4882a593SmuzhiyunNull was more trouble than it was worth, so I got rid of it. Anything 797*4882a593Smuzhiyunthat used to return Null now returns None. 798*4882a593Smuzhiyun 799*4882a593SmuzhiyunSpecial XML constructs like comments and CDATA now have their own 800*4882a593SmuzhiyunNavigableString subclasses, instead of being treated as oddly-formed 801*4882a593Smuzhiyundata. If you parse a document that contains CDATA and write it back 802*4882a593Smuzhiyunout, the CDATA will still be there. 803*4882a593Smuzhiyun 804*4882a593SmuzhiyunWhen you're parsing a document, you can get Beautiful Soup to convert 805*4882a593SmuzhiyunXML or HTML entities into the corresponding Unicode characters. [Doc 806*4882a593Smuzhiyunreference] 807*4882a593Smuzhiyun 808*4882a593Smuzhiyun= 2.1.1 (20050918) = 809*4882a593Smuzhiyun 810*4882a593SmuzhiyunFixed a serious performance bug in BeautifulStoneSoup which was 811*4882a593Smuzhiyuncausing parsing to be incredibly slow. 812*4882a593Smuzhiyun 813*4882a593SmuzhiyunCorrected several entities that were previously being incorrectly 814*4882a593Smuzhiyuntranslated from Microsoft smart-quote-like characters. 815*4882a593Smuzhiyun 816*4882a593SmuzhiyunFixed a bug that was breaking text fetch. 817*4882a593Smuzhiyun 818*4882a593SmuzhiyunFixed a bug that crashed the parser when text chunks that look like 819*4882a593SmuzhiyunHTML tag names showed up within a SCRIPT tag. 820*4882a593Smuzhiyun 821*4882a593SmuzhiyunTHEAD, TBODY, and TFOOT tags are now nestable within TABLE 822*4882a593Smuzhiyuntags. Nested tables should parse more sensibly now. 823*4882a593Smuzhiyun 824*4882a593SmuzhiyunBASE is now considered a self-closing tag. 825*4882a593Smuzhiyun 826*4882a593Smuzhiyun= 2.1.0 "Game, or any other dish?" (20050504) = 827*4882a593Smuzhiyun 828*4882a593SmuzhiyunAdded a wide variety of new search methods which, given a starting 829*4882a593Smuzhiyunpoint inside the tree, follow a particular navigation member (like 830*4882a593SmuzhiyunnextSibling) over and over again, looking for Tag and NavigableText 831*4882a593Smuzhiyunobjects that match certain criteria. The new methods are findNext, 832*4882a593SmuzhiyunfetchNext, findPrevious, fetchPrevious, findNextSibling, 833*4882a593SmuzhiyunfetchNextSiblings, findPreviousSibling, fetchPreviousSiblings, 834*4882a593SmuzhiyunfindParent, and fetchParents. All of these use the same basic code 835*4882a593Smuzhiyunused by first and fetch, so you can pass your weird ways of matching 836*4882a593Smuzhiyunthings into these methods. 837*4882a593Smuzhiyun 838*4882a593SmuzhiyunThe fetch method and its derivatives now accept a limit argument. 839*4882a593Smuzhiyun 840*4882a593SmuzhiyunYou can now pass keyword arguments when calling a Tag object as though 841*4882a593Smuzhiyunit were a method. 842*4882a593Smuzhiyun 843*4882a593SmuzhiyunFixed a bug that caused all hand-created tags to share a single set of 844*4882a593Smuzhiyunattributes. 845*4882a593Smuzhiyun 846*4882a593Smuzhiyun= 2.0.3 (20050501) = 847*4882a593Smuzhiyun 848*4882a593SmuzhiyunFixed Python 2.2 support for iterators. 849*4882a593Smuzhiyun 850*4882a593SmuzhiyunFixed a bug that gave the wrong representation to tags within quote 851*4882a593Smuzhiyuntags like <script>. 852*4882a593Smuzhiyun 853*4882a593SmuzhiyunTook some code from Mark Pilgrim that treats CDATA declarations as 854*4882a593Smuzhiyundata instead of ignoring them. 855*4882a593Smuzhiyun 856*4882a593SmuzhiyunBeautiful Soup's setup.py will now do an install even if the unit 857*4882a593Smuzhiyuntests fail. It won't build a source distribution if the unit tests 858*4882a593Smuzhiyunfail, so I can't release a new version unless they pass. 859*4882a593Smuzhiyun 860*4882a593Smuzhiyun= 2.0.2 (20050416) = 861*4882a593Smuzhiyun 862*4882a593SmuzhiyunAdded the unit tests in a separate module, and packaged it with 863*4882a593Smuzhiyundistutils. 864*4882a593Smuzhiyun 865*4882a593SmuzhiyunFixed a bug that sometimes caused renderContents() to return a Unicode 866*4882a593Smuzhiyunstring even if there was no Unicode in the original string. 867*4882a593Smuzhiyun 868*4882a593SmuzhiyunAdded the done() method, which closes all of the parser's open 869*4882a593Smuzhiyuntags. It gets called automatically when you pass in some text to the 870*4882a593Smuzhiyunconstructor of a parser class; otherwise you must call it yourself. 871*4882a593Smuzhiyun 872*4882a593SmuzhiyunReinstated some backwards compatibility with 1.x versions: referencing 873*4882a593Smuzhiyunthe string member of a NavigableText object returns the NavigableText 874*4882a593Smuzhiyunobject instead of throwing an error. 875*4882a593Smuzhiyun 876*4882a593Smuzhiyun= 2.0.1 (20050412) = 877*4882a593Smuzhiyun 878*4882a593SmuzhiyunFixed a bug that caused bad results when you tried to reference a tag 879*4882a593Smuzhiyunname shorter than 3 characters as a member of a Tag, eg. tag.table.td. 880*4882a593Smuzhiyun 881*4882a593SmuzhiyunMade sure all Tags have the 'hidden' attribute so that an attempt to 882*4882a593Smuzhiyunaccess tag.hidden doesn't spawn an attempt to find a tag named 883*4882a593Smuzhiyun'hidden'. 884*4882a593Smuzhiyun 885*4882a593SmuzhiyunFixed a bug in the comparison operator. 886*4882a593Smuzhiyun 887*4882a593Smuzhiyun= 2.0.0 "Who cares for fish?" (20050410) 888*4882a593Smuzhiyun 889*4882a593SmuzhiyunBeautiful Soup version 1 was very useful but also pretty stupid. I 890*4882a593Smuzhiyunoriginally wrote it without noticing any of the problems inherent in 891*4882a593Smuzhiyuntrying to build a parse tree out of ambiguous HTML tags. This version 892*4882a593Smuzhiyunsolves all of those problems to my satisfaction. It also adds many new 893*4882a593Smuzhiyunclever things to make up for the removal of the stupid things. 894*4882a593Smuzhiyun 895*4882a593Smuzhiyun== Parsing == 896*4882a593Smuzhiyun 897*4882a593SmuzhiyunThe parser logic has been greatly improved, and the BeautifulSoup 898*4882a593Smuzhiyunclass should much more reliably yield a parse tree that looks like 899*4882a593Smuzhiyunwhat the page author intended. For a particular class of odd edge 900*4882a593Smuzhiyuncases that now causes problems, there is a new class, 901*4882a593SmuzhiyunICantBelieveItsBeautifulSoup. 902*4882a593Smuzhiyun 903*4882a593SmuzhiyunBy default, Beautiful Soup now performs some cleanup operations on 904*4882a593Smuzhiyuntext before parsing it. This is to avoid common problems with bad 905*4882a593Smuzhiyundefinitions and self-closing tags that crash SGMLParser. You can 906*4882a593Smuzhiyunprovide your own set of cleanup operations, or turn it off 907*4882a593Smuzhiyunaltogether. The cleanup operations include fixing self-closing tags 908*4882a593Smuzhiyunthat don't close, and replacing Microsoft smart quotes and similar 909*4882a593Smuzhiyuncharacters with their HTML entity equivalents. 910*4882a593Smuzhiyun 911*4882a593SmuzhiyunYou can now get a pretty-print version of parsed HTML to get a visual 912*4882a593Smuzhiyunpicture of how Beautiful Soup parses it, with the Tag.prettify() 913*4882a593Smuzhiyunmethod. 914*4882a593Smuzhiyun 915*4882a593Smuzhiyun== Strings and Unicode == 916*4882a593Smuzhiyun 917*4882a593SmuzhiyunThere are separate NavigableText subclasses for ASCII and Unicode 918*4882a593Smuzhiyunstrings. These classes directly subclass the corresponding base data 919*4882a593Smuzhiyuntypes. This means you can treat NavigableText objects as strings 920*4882a593Smuzhiyuninstead of having to call methods on them to get the strings. 921*4882a593Smuzhiyun 922*4882a593Smuzhiyunstr() on a Tag always returns a string, and unicode() always returns 923*4882a593SmuzhiyunUnicode. Previously it was inconsistent. 924*4882a593Smuzhiyun 925*4882a593Smuzhiyun== Tree traversal == 926*4882a593Smuzhiyun 927*4882a593SmuzhiyunIn a first() or fetch() call, the tag name or the desired value of an 928*4882a593Smuzhiyunattribute can now be any of the following: 929*4882a593Smuzhiyun 930*4882a593Smuzhiyun * A string (matches that specific tag or that specific attribute value) 931*4882a593Smuzhiyun * A list of strings (matches any tag or attribute value in the list) 932*4882a593Smuzhiyun * A compiled regular expression object (matches any tag or attribute 933*4882a593Smuzhiyun value that matches the regular expression) 934*4882a593Smuzhiyun * A callable object that takes the Tag object or attribute value as a 935*4882a593Smuzhiyun string. It returns None/false/empty string if the given string 936*4882a593Smuzhiyun doesn't match, and any other value if it does. 937*4882a593Smuzhiyun 938*4882a593SmuzhiyunThis is much easier to use than SQL-style wildcards (see, regular 939*4882a593Smuzhiyunexpressions are good for something). Because of this, I took out 940*4882a593SmuzhiyunSQL-style wildcards. I'll put them back if someone complains, but 941*4882a593Smuzhiyuntheir removal simplifies the code a lot. 942*4882a593Smuzhiyun 943*4882a593SmuzhiyunYou can use fetch() and first() to search for text in the parse tree, 944*4882a593Smuzhiyunnot just tags. There are new alias methods fetchText() and firstText() 945*4882a593Smuzhiyundesigned for this purpose. As with searching for tags, you can pass in 946*4882a593Smuzhiyuna string, a regular expression object, or a method to match your text. 947*4882a593Smuzhiyun 948*4882a593SmuzhiyunIf you pass in something besides a map to the attrs argument of 949*4882a593Smuzhiyunfetch() or first(), Beautiful Soup will assume you want to match that 950*4882a593Smuzhiyunthing against the "class" attribute. When you're scraping 951*4882a593Smuzhiyunwell-structured HTML, this makes your code a lot cleaner. 952*4882a593Smuzhiyun 953*4882a593Smuzhiyun1.x and 2.x both let you call a Tag object as a shorthand for 954*4882a593Smuzhiyunfetch(). For instance, foo("bar") is a shorthand for 955*4882a593Smuzhiyunfoo.fetch("bar"). In 2.x, you can also access a specially-named member 956*4882a593Smuzhiyunof a Tag object as a shorthand for first(). For instance, foo.barTag 957*4882a593Smuzhiyunis a shorthand for foo.first("bar"). By chaining these shortcuts you 958*4882a593Smuzhiyuntraverse a tree in very little code: for header in 959*4882a593Smuzhiyunsoup.bodyTag.pTag.tableTag('th'): 960*4882a593Smuzhiyun 961*4882a593SmuzhiyunIf an element relationship (like parent or next) doesn't apply to a 962*4882a593Smuzhiyuntag, it'll now show up Null instead of None. first() will also return 963*4882a593SmuzhiyunNull if you ask it for a nonexistent tag. Null is an object that's 964*4882a593Smuzhiyunjust like None, except you can do whatever you want to it and it'll 965*4882a593Smuzhiyungive you Null instead of throwing an error. 966*4882a593Smuzhiyun 967*4882a593SmuzhiyunThis lets you do tree traversals like soup.htmlTag.headTag.titleTag 968*4882a593Smuzhiyunwithout having to worry if the intermediate stages are actually 969*4882a593Smuzhiyunthere. Previously, if there was no 'head' tag in the document, headTag 970*4882a593Smuzhiyunin that instance would have been None, and accessing its 'titleTag' 971*4882a593Smuzhiyunmember would have thrown an AttributeError. Now, you can get what you 972*4882a593Smuzhiyunwant when it exists, and get Null when it doesn't, without having to 973*4882a593Smuzhiyundo a lot of conditionals checking to see if every stage is None. 974*4882a593Smuzhiyun 975*4882a593SmuzhiyunThere are two new relations between page elements: previousSibling and 976*4882a593SmuzhiyunnextSibling. They reference the previous and next element at the same 977*4882a593Smuzhiyunlevel of the parse tree. For instance, if you have HTML like this: 978*4882a593Smuzhiyun 979*4882a593Smuzhiyun <p><ul><li>Foo<br /><li>Bar</ul> 980*4882a593Smuzhiyun 981*4882a593SmuzhiyunThe first 'li' tag has a previousSibling of Null and its nextSibling 982*4882a593Smuzhiyunis the second 'li' tag. The second 'li' tag has a nextSibling of Null 983*4882a593Smuzhiyunand its previousSibling is the first 'li' tag. The previousSibling of 984*4882a593Smuzhiyunthe 'ul' tag is the first 'p' tag. The nextSibling of 'Foo' is the 985*4882a593Smuzhiyun'br' tag. 986*4882a593Smuzhiyun 987*4882a593SmuzhiyunI took out the ability to use fetch() to find tags that have a 988*4882a593Smuzhiyunspecific list of contents. See, I can't even explain it well. It was 989*4882a593Smuzhiyunreally difficult to use, I never used it, and I don't think anyone 990*4882a593Smuzhiyunelse ever used it. To the extent anyone did, they can probably use 991*4882a593SmuzhiyunfetchText() instead. If it turns out someone needs it I'll think of 992*4882a593Smuzhiyunanother solution. 993*4882a593Smuzhiyun 994*4882a593Smuzhiyun== Tree manipulation == 995*4882a593Smuzhiyun 996*4882a593SmuzhiyunYou can add new attributes to a tag, and delete attributes from a 997*4882a593Smuzhiyuntag. In 1.x you could only change a tag's existing attributes. 998*4882a593Smuzhiyun 999*4882a593Smuzhiyun== Porting Considerations == 1000*4882a593Smuzhiyun 1001*4882a593SmuzhiyunThere are three changes in 2.0 that break old code: 1002*4882a593Smuzhiyun 1003*4882a593SmuzhiyunIn the post-1.2 release you could pass in a function into fetch(). The 1004*4882a593Smuzhiyunfunction took a string, the tag name. In 2.0, the function takes the 1005*4882a593Smuzhiyunactual Tag object. 1006*4882a593Smuzhiyun 1007*4882a593SmuzhiyunIt's no longer to pass in SQL-style wildcards to fetch(). Use a 1008*4882a593Smuzhiyunregular expression instead. 1009*4882a593Smuzhiyun 1010*4882a593SmuzhiyunThe different parsing algorithm means the parse tree may not be shaped 1011*4882a593Smuzhiyunlike you expect. This will only actually affect you if your code uses 1012*4882a593Smuzhiyunone of the affected parts. I haven't run into this problem yet while 1013*4882a593Smuzhiyunporting my code. 1014*4882a593Smuzhiyun 1015*4882a593Smuzhiyun= Between 1.2 and 2.0 = 1016*4882a593Smuzhiyun 1017*4882a593SmuzhiyunThis is the release to get if you want Python 1.5 compatibility. 1018*4882a593Smuzhiyun 1019*4882a593SmuzhiyunThe desired value of an attribute can now be any of the following: 1020*4882a593Smuzhiyun 1021*4882a593Smuzhiyun * A string 1022*4882a593Smuzhiyun * A string with SQL-style wildcards 1023*4882a593Smuzhiyun * A compiled RE object 1024*4882a593Smuzhiyun * A callable that returns None/false/empty string if the given value 1025*4882a593Smuzhiyun doesn't match, and any other value otherwise. 1026*4882a593Smuzhiyun 1027*4882a593SmuzhiyunThis is much easier to use than SQL-style wildcards (see, regular 1028*4882a593Smuzhiyunexpressions are good for something). Because of this, I no longer 1029*4882a593Smuzhiyunrecommend you use SQL-style wildcards. They may go away in a future 1030*4882a593Smuzhiyunrelease to clean up the code. 1031*4882a593Smuzhiyun 1032*4882a593SmuzhiyunMade Beautiful Soup handle processing instructions as text instead of 1033*4882a593Smuzhiyunignoring them. 1034*4882a593Smuzhiyun 1035*4882a593SmuzhiyunApplied patch from Richie Hindle (richie at entrian dot com) that 1036*4882a593Smuzhiyunmakes tag.string a shorthand for tag.contents[0].string when the tag 1037*4882a593Smuzhiyunhas only one string-owning child. 1038*4882a593Smuzhiyun 1039*4882a593SmuzhiyunAdded still more nestable tags. The nestable tags thing won't work in 1040*4882a593Smuzhiyuna lot of cases and needs to be rethought. 1041*4882a593Smuzhiyun 1042*4882a593SmuzhiyunFixed an edge case where searching for "%foo" would match any string 1043*4882a593Smuzhiyunshorter than "foo". 1044*4882a593Smuzhiyun 1045*4882a593Smuzhiyun= 1.2 "Who for such dainties would not stoop?" (20040708) = 1046*4882a593Smuzhiyun 1047*4882a593SmuzhiyunApplied patch from Ben Last (ben at benlast dot com) that made 1048*4882a593SmuzhiyunTag.renderContents() correctly handle Unicode. 1049*4882a593Smuzhiyun 1050*4882a593SmuzhiyunMade BeautifulStoneSoup even dumber by making it not implicitly close 1051*4882a593Smuzhiyuna tag when another tag of the same type is encountered; only when an 1052*4882a593Smuzhiyunactual closing tag is encountered. This change courtesy of Fuzzy (mike 1053*4882a593Smuzhiyunat pcblokes dot com). BeautifulSoup still works as before. 1054*4882a593Smuzhiyun 1055*4882a593Smuzhiyun= 1.1 "Swimming in a hot tureen" = 1056*4882a593Smuzhiyun 1057*4882a593SmuzhiyunAdded more 'nestable' tags. Changed popping semantics so that when a 1058*4882a593Smuzhiyunnestable tag is encountered, tags are popped up to the previously 1059*4882a593Smuzhiyunencountered nestable tag (of whatever kind). I will revert this if 1060*4882a593Smuzhiyunenough people complain, but it should make more people's lives easier 1061*4882a593Smuzhiyunthan harder. This enhancement was suggested by Anthony Baxter (anthony 1062*4882a593Smuzhiyunat interlink dot com dot au). 1063*4882a593Smuzhiyun 1064*4882a593Smuzhiyun= 1.0 "So rich and green" (20040420) = 1065*4882a593Smuzhiyun 1066*4882a593SmuzhiyunInitial release. 1067