pyquery
changeset 80:91a4330801b9 0.3.1
merge
| author | Gael Pasgrimaud <gael@gawel.org> |
|---|---|
| date | Sat Jan 24 03:08:56 2009 +0100 (18 months ago) |
| parents | 58b15bae680f 45ac7d97a0ae |
| children | e756c934656d |
| files | pyquery/README.txt pyquery/pyquery.py pyquery/test.py |
line diff
1.1 --- a/.hgtags Sat Jan 24 03:00:22 2009 +0100 1.2 +++ b/.hgtags Sat Jan 24 03:08:56 2009 +0100 1.3 @@ -1,1 +1,2 @@ 1.4 87f002ce396754a04a55b4dc8494f38100957108 0.2 1.5 +9796ea9cb849ce66ca29b394b55a265fe2acb332 0.3
2.1 --- a/pyquery/README.txt Sat Jan 24 03:00:22 2009 +0100 2.2 +++ b/pyquery/README.txt Sat Jan 24 03:08:56 2009 +0100 2.3 @@ -11,6 +11,18 @@ 2.4 2.5 It can be used for many purposes, one idea that I might try in the future is to 2.6 use it for templating with pure http templates that you modify using pyquery. 2.7 +I can also be used for web scrapping or for theming applications with 2.8 +`Deliverance`_. 2.9 + 2.10 +The `project`_ is being actively developped on a mercurial repository on 2.11 +Bitbucket. I have the policy of giving push access to anyone who wants it 2.12 +and then to review what he does. So if you want to contribute just email me. 2.13 + 2.14 +The Sphinx documentation is available on `pyquery.org`_. 2.15 + 2.16 +.. _deliverance: http://www.gawel.org/weblog/en/2008/12/skinning-with-pyquery-and-deliverance 2.17 +.. _project: http://www.bitbucket.org/olauzanne/pyquery/ 2.18 +.. _pyquery.org: http://pyquery.org/ 2.19 2.20 .. contents:: 2.21 2.22 @@ -42,7 +54,8 @@ 2.23 'you know Python rocks' 2.24 2.25 You can use some of the pseudo classes that are available in jQuery but that 2.26 -are not standard in css such as :first :last :even :odd :eq :lt :gt:: 2.27 +are not standard in css such as :first :last :even :odd :eq :lt :gt :checked 2.28 +:selected :file:: 2.29 2.30 >>> d('p:first') 2.31 [<p#hello.hello>] 2.32 @@ -117,63 +130,18 @@ 2.33 Traversing 2.34 ---------- 2.35 2.36 -Some jQuery traversal methods are supported. For instance, you can filter the selection list 2.37 -using a string selector:: 2.38 +Some jQuery traversal methods are supported. Here are a few examples. 2.39 + 2.40 +You can filter the selection list using a string selector:: 2.41 2.42 >>> d('p').filter('.hello') 2.43 [<p#hello.hello>] 2.44 2.45 -Filtering can also be done using a function:: 2.46 - 2.47 - >>> d('p').filter(lambda i: i == 1) 2.48 - [<p#test>] 2.49 - 2.50 -Filtering functions can refer to the current element as 'this', like in jQuery:: 2.51 - 2.52 - >>> d('p').filter(lambda i: pq(this).text() == 'you know Python rocks') 2.53 - [<p#hello.hello>] 2.54 - 2.55 -The opposite of filter is `not_` - it returns the items that don't match the selector:: 2.56 - 2.57 - >>> d('p').not_('.hello') 2.58 - [<p#test>] 2.59 - 2.60 -You can map a callable onto a PyQuery and get a mutated result. The result can 2.61 -contain any items, not just elements:: 2.62 - 2.63 - >>> d('p').map(lambda i, e: pq(e).text()) 2.64 - ['you know Python rocks', 'hello python !'] 2.65 - 2.66 -Like the filter method, map callbacks can reference the current item as this:: 2.67 - 2.68 - >>> d('p').map(lambda i, e: len(pq(this).text())) 2.69 - [21, 14] 2.70 - 2.71 -The map callback can also return a list, which will extend the resulting 2.72 -PyQuery:: 2.73 - 2.74 - >>> d('p').map(lambda i, e: pq(this).text().split()) 2.75 - ['you', 'know', 'Python', 'rocks', 'hello', 'python', '!'] 2.76 - 2.77 It is possible to select a single element with eq:: 2.78 2.79 >>> d('p').eq(0) 2.80 [<p#hello.hello>] 2.81 2.82 -The `is_` method lets you query if any current elements match the selector:: 2.83 - 2.84 - >>> d('p').eq(0).is_('.hello') 2.85 - True 2.86 - >>> d('p').eq(1).is_('.hello') 2.87 - False 2.88 - 2.89 -hasClass allows for checking for the presence of a class by name:: 2.90 - 2.91 - >>> d('p').eq(0).hasClass('hello') 2.92 - True 2.93 - >>> d('p').eq(1).hasClass('hello') 2.94 - False 2.95 - 2.96 You can find nested elements:: 2.97 2.98 >>> d('p').find('a') 2.99 @@ -331,17 +299,34 @@ 2.100 Making links absolute 2.101 --------------------- 2.102 2.103 -You can make all links on a page absolute which can be usefull for screen 2.104 -scrapping:: 2.105 +You can make links absolute which can be usefull for screen scrapping:: 2.106 2.107 - >>> d = pq(url='http://google.com') 2.108 - >>> d('a:last').attr('href') 2.109 - '/intl/fr/privacy.html' 2.110 + >>> d = pq(url='http://www.w3.org/', parser='html') 2.111 + >>> d('a[title="W3C Activities"]').attr('href') 2.112 + '/Consortium/activities' 2.113 >>> d.make_links_absolute() 2.114 [<html>] 2.115 - >>> d('a:last').attr('href') 2.116 - 'http://google.com/intl/fr/privacy.html' 2.117 + >>> d('a[title="W3C Activities"]').attr('href') 2.118 + 'http://www.w3.org/Consortium/activities' 2.119 2.120 +Using different parsers 2.121 +----------------------- 2.122 + 2.123 +By default pyquery uses the lxml xml parser and then if it doesn't work goes on 2.124 +to try the html parser from lxml.html. The xml parser can sometimes be 2.125 +problematic when parsing xhtml pages because the parser will not raise an error 2.126 +but give an unusable tree (on w3c.org for example). 2.127 + 2.128 +You can also choose which parser to use explicitly:: 2.129 + 2.130 + >>> pq('<html><body><p>toto</p></body></html>', parser='xml') 2.131 + [<html>] 2.132 + >>> pq('<html><body><p>toto</p></body></html>', parser='html') 2.133 + [<html>] 2.134 + >>> pq('<html><body><p>toto</p></body></html>', parser='html_fragments') 2.135 + [<p>] 2.136 + 2.137 +The html and html_fragments parser are the ones from lxml.html. 2.138 2.139 Testing 2.140 ------- 2.141 @@ -363,24 +348,28 @@ 2.142 2.143 $ STATIC_DEPS=true bin/buildout 2.144 2.145 -Other documentations 2.146 --------------------- 2.147 +More documentation 2.148 +------------------ 2.149 2.150 -For more documentation about the API use the jquery website http://docs.jquery.com/ 2.151 +First there is the Sphinx documentation `here`_. 2.152 +Then for more documentation about the API you can use the `jquery website`_. 2.153 +The reference I'm now using for the API is ... the `color cheat sheet`_. 2.154 +Then you can always look at the `code`_. 2.155 2.156 -The reference I'm now using for the API is ... the color cheat sheet 2.157 -http://colorcharge.com/wp-content/uploads/2007/12/jquery12_colorcharge.png 2.158 +.. _jquery website: http://docs.jquery.com/ 2.159 +.. _code: http://www.bitbucket.org/olauzanne/pyquery/src/tip/pyquery/pyquery.py 2.160 +.. _here: http://pyquery.org 2.161 +.. _color cheat sheet: http://colorcharge.com/wp-content/uploads/2007/12/jquery12_colorcharge.png 2.162 2.163 TODO 2.164 ---- 2.165 2.166 -- SELECTORS: it works fine but missing all the :xxx (:first, :last, ...) can be 2.167 - done by patching lxml.cssselect 2.168 +- SELECTORS: still missing some jQuery pseudo classes (:radio, :password, ...) 2.169 - ATTRIBUTES: done 2.170 - CSS: done 2.171 - HTML: done 2.172 -- MANIPULATING: did all but the "wrap" methods 2.173 -- TRAVERSING: did a few 2.174 +- MANIPULATING: missing the wrapAll and wrapInner methods 2.175 +- TRAVERSING: about half done 2.176 - EVENTS: nothing to do with server side might be used later for automatic ajax 2.177 - CORE UI EFFECTS: did hide and show the rest doesn't really makes sense on 2.178 server side
3.1 --- a/pyquery/cssselectpatch.py Sat Jan 24 03:00:22 2009 +0100 3.2 +++ b/pyquery/cssselectpatch.py Sat Jan 24 03:08:56 2009 +0100 3.3 @@ -36,6 +36,36 @@ 3.4 xpath.add_post_condition('position() mod 2 = 0') 3.5 return xpath 3.6 3.7 + def _xpath_checked(self, xpath): 3.8 + """Matches odd elements, zero-indexed. 3.9 + """ 3.10 + xpath.add_condition("@checked and name(.) = 'input'") 3.11 + return xpath 3.12 + 3.13 + def _xpath_selected(self, xpath): 3.14 + """Matches all elements that are selected. 3.15 + """ 3.16 + xpath.add_condition("@selected and name(.) = 'option'") 3.17 + return xpath 3.18 + 3.19 + def _xpath_disabled(self, xpath): 3.20 + """Matches all elements that are disabled. 3.21 + """ 3.22 + xpath.add_condition("@disabled") 3.23 + return xpath 3.24 + 3.25 + def _xpath_enabled(self, xpath): 3.26 + """Matches all elements that are disabled. 3.27 + """ 3.28 + xpath.add_condition("not(@disabled) and name(.) = 'input'") 3.29 + return xpath 3.30 + 3.31 + def _xpath_file(self, xpath): 3.32 + """Matches all input elements of type file. 3.33 + """ 3.34 + xpath.add_condition("@type = 'file' and name(.) = 'input'") 3.35 + return xpath 3.36 + 3.37 cssselect.Pseudo = JQueryPseudo 3.38 3.39 class JQueryFunction(Function):
4.1 --- a/pyquery/pyquery.py Sat Jan 24 03:00:22 2009 +0100 4.2 +++ b/pyquery/pyquery.py Sat Jan 24 03:08:56 2009 +0100 4.3 @@ -5,16 +5,26 @@ 4.4 # Distributed under the BSD license, see LICENSE.txt 4.5 from cssselectpatch import selector_to_xpath 4.6 from lxml import etree 4.7 +import lxml.html 4.8 from copy import deepcopy 4.9 from urlparse import urljoin 4.10 4.11 -def fromstring(context): 4.12 +def fromstring(context, parser=None): 4.13 """use html parser if we don't have clean xml 4.14 """ 4.15 - try: 4.16 - return etree.fromstring(context) 4.17 - except etree.XMLSyntaxError: 4.18 - return etree.fromstring(context, etree.HTMLParser()) 4.19 + if parser == None: 4.20 + try: 4.21 + return [etree.fromstring(context)] 4.22 + except etree.XMLSyntaxError: 4.23 + return [lxml.html.fromstring(context)] 4.24 + elif parser == 'xml': 4.25 + return [etree.fromstring(context)] 4.26 + elif parser == 'html': 4.27 + return [lxml.html.fromstring(context)] 4.28 + elif parser == 'html_fragments': 4.29 + return lxml.html.fragments_fromstring(context) 4.30 + else: 4.31 + ValueError('No such parser: "%s"' % parser) 4.32 4.33 class NoDefault(object): 4.34 def __repr__(self): 4.35 @@ -59,6 +69,13 @@ 4.36 html = None 4.37 elements = [] 4.38 self._base_url = None 4.39 + parser = kwargs.get('parser') 4.40 + if 'parser' in kwargs: 4.41 + del kwargs['parser'] 4.42 + if not kwargs and len(args) == 1 and isinstance(args[0], basestring) \ 4.43 + and args[0].startswith('http://'): 4.44 + kwargs = {'url': args[0]} 4.45 + args = [] 4.46 4.47 if 'parent' in kwargs: 4.48 self._parent = kwargs.pop('parent') 4.49 @@ -76,7 +93,7 @@ 4.50 self._base_url = url 4.51 else: 4.52 raise ValueError('Invalid keyword arguments %s' % kwargs) 4.53 - elements = [fromstring(html)] 4.54 + elements = fromstring(html, parser) 4.55 else: 4.56 # get nodes 4.57 4.58 @@ -94,7 +111,7 @@ 4.59 # get context 4.60 if isinstance(context, basestring): 4.61 try: 4.62 - elements = [fromstring(context)] 4.63 + elements = fromstring(context, parser) 4.64 except Exception, e: 4.65 raise ValueError('%r, %s' % (e, context)) 4.66 elif isinstance(context, self.__class__): 4.67 @@ -164,7 +181,18 @@ 4.68 ############## 4.69 4.70 def filter(self, selector): 4.71 - """Filter elements in self using selector (string or function).""" 4.72 + """Filter elements in self using selector (string or function). 4.73 + 4.74 + >>> d = PyQuery('<p class="hello">Hi</p><p>Bye</p>') 4.75 + >>> d('p') 4.76 + [<p.hello>, <p>] 4.77 + >>> d('p').filter('.hello') 4.78 + [<p.hello>] 4.79 + >>> d('p').filter(lambda i: i == 1) 4.80 + [<p>] 4.81 + >>> d('p').filter(lambda i: PyQuery(this).text() == 'Hi') 4.82 + [<p.hello>] 4.83 + """ 4.84 if not callable(selector): 4.85 return self.__class__(selector, self, **dict(parent=self)) 4.86 else: 4.87 @@ -179,16 +207,35 @@ 4.88 return self.__class__(elements, **dict(parent=self)) 4.89 4.90 def not_(self, selector): 4.91 - """Return elements that don't match the given selector.""" 4.92 + """Return elements that don't match the given selector. 4.93 + 4.94 + >>> d = PyQuery('<p class="hello">Hi</p><p>Bye</p><div></div>') 4.95 + >>> d('p').not_('.hello') 4.96 + [<p>] 4.97 + """ 4.98 exclude = set(self.__class__(selector, self)) 4.99 return self.__class__([e for e in self if e not in exclude], **dict(parent=self)) 4.100 4.101 def is_(self, selector): 4.102 - """Returns True if selector matches at least one current element, else False.""" 4.103 + """Returns True if selector matches at least one current element, else False. 4.104 + >>> d = PyQuery('<p class="hello">Hi</p><p>Bye</p><div></div>') 4.105 + >>> d('p').eq(0).is_('.hello') 4.106 + True 4.107 + >>> d('p').eq(1).is_('.hello') 4.108 + False 4.109 + """ 4.110 return bool(self.__class__(selector, self)) 4.111 4.112 def find(self, selector): 4.113 - """Find elements using selector traversing down from self.""" 4.114 + """Find elements using selector traversing down from self. 4.115 + 4.116 + >>> m = '<p><span><em>Whoah!</em></span></p><p><em> there</em></p>' 4.117 + >>> d = PyQuery(m) 4.118 + >>> d('p').find('em') 4.119 + [<em>, <em>] 4.120 + >>> d('p').eq(1).find('em') 4.121 + [<em>] 4.122 + """ 4.123 xpath = selector_to_xpath(selector) 4.124 results = [child.xpath(xpath) for tag in self for child in tag.getchildren()] 4.125 # Flatten the results 4.126 @@ -198,7 +245,14 @@ 4.127 return self.__class__(elements, **dict(parent=self)) 4.128 4.129 def eq(self, index): 4.130 - """Return PyQuery of only the element with the provided index.""" 4.131 + """Return PyQuery of only the element with the provided index. 4.132 + 4.133 + >>> d = PyQuery('<p class="hello">Hi</p><p>Bye</p><div></div>') 4.134 + >>> d('p').eq(0) 4.135 + [<p.hello>] 4.136 + >>> d('p').eq(1) 4.137 + [<p>] 4.138 + """ 4.139 return self.__class__([self[index]], **dict(parent=self)) 4.140 4.141 def each(self, func): 4.142 @@ -213,6 +267,16 @@ 4.143 4.144 func should take two arguments - 'index' and 'element'. Elements can 4.145 also be referred to as 'this' inside of func. 4.146 + 4.147 + >>> d = PyQuery('<p class="hello">Hi there</p><p>Bye</p><br />') 4.148 + >>> d('p').map(lambda i, e: PyQuery(e).text()) 4.149 + ['Hi there', 'Bye'] 4.150 + 4.151 + >>> d('p').map(lambda i, e: len(PyQuery(this).text())) 4.152 + [8, 3] 4.153 + 4.154 + >>> d('p').map(lambda i, e: PyQuery(this).text().split()) 4.155 + ['Hi', 'there', 'Bye'] 4.156 """ 4.157 items = [] 4.158 try: 4.159 @@ -236,6 +300,13 @@ 4.160 return len(self) 4.161 4.162 def end(self): 4.163 + """Break out of a level of traversal and return to the parent level. 4.164 + 4.165 + >>> m = '<p><span><em>Whoah!</em></span></p><p><em> there</em></p>' 4.166 + >>> d = PyQuery(m) 4.167 + >>> d('p').eq(1).find('em').end().end() 4.168 + [<p>, <p>] 4.169 + """ 4.170 return self._parent 4.171 4.172 ############## 4.173 @@ -650,7 +721,7 @@ 4.174 4.175 """ 4.176 assert isinstance(value, basestring) 4.177 - value = fromstring(value) 4.178 + value = fromstring(value)[0] 4.179 nodes = [] 4.180 for tag in self: 4.181 wrapper = deepcopy(value) 4.182 @@ -685,7 +756,7 @@ 4.183 return self 4.184 4.185 assert isinstance(value, basestring) 4.186 - value = fromstring(value) 4.187 + value = fromstring(value)[0] 4.188 wrapper = deepcopy(value) 4.189 if not wrapper.getchildren(): 4.190 child = wrapper
5.1 --- a/pyquery/test.py Sat Jan 24 03:00:22 2009 +0100 5.2 +++ b/pyquery/test.py Sat Jan 24 03:08:56 2009 +0100 5.3 @@ -79,6 +79,27 @@ 5.4 </html> 5.5 """ 5.6 5.7 + html4 = """ 5.8 + <html> 5.9 + <body> 5.10 + <form action="/"> 5.11 + <input name="enabled" type="text" value="test"/> 5.12 + <input name="disabled" type="text" value="disabled" disabled="disabled"/> 5.13 + <input name="file" type="file" /> 5.14 + <select name="select"> 5.15 + <option value="">Choose something</option> 5.16 + <option value="one">One</option> 5.17 + <option value="two" selected="selected">Two</option> 5.18 + <option value="three">Three</option> 5.19 + </select> 5.20 + <input name="radio" type="radio" value="one"/> 5.21 + <input name="radio" type="radio" value="two" checked="checked"/> 5.22 + <input name="radio" type="radio" value="three"/> 5.23 + </form> 5.24 + </body> 5.25 + </html> 5.26 + """ 5.27 + 5.28 def test_selector_from_doc(self): 5.29 doc = etree.fromstring(self.html) 5.30 assert len(self.klass(doc)) == 1 5.31 @@ -118,6 +139,14 @@ 5.32 self.assertEqual(e('div:lt(1)').text(), 'node1') 5.33 self.assertEqual(e('div:eq(2)').text(), 'node3') 5.34 5.35 + #test on the form 5.36 + e = self.klass(self.html4) 5.37 + assert len(e(':disabled')) == 1 5.38 + assert len(e('input:enabled')) == 5 5.39 + assert len(e(':selected')) == 1 5.40 + assert len(e(':checked')) == 1 5.41 + assert len(e(':file')) == 1 5.42 + 5.43 class TestTraversal(unittest.TestCase): 5.44 klass = pq 5.45 html = """
