春运又到了,献上更新版抓黄牛脚本。
好不容易搞定了火车票(当然不是通过酷讯或者黄牛),把去年写过的抓黄牛脚本重写了一下,提供给各位还在等待购买火车票的 Programmer 使用。说是抓黄牛,自然还包括普通转票者。原理还是通过轮询酷讯网站上的内容,但是增加了几个新特性:
- 用 re 提供的正则表达式替换掉了 SGMLParser 提高效率
- 可以轮询多个地址了,比如我到吉安和井冈山都可以,所以我要遍历两个地址
- 可以将转向链接直接打印在屏幕上了
- 提供了 Python 3 的 Package 级支持,但是因为 re 模块变更,正则表达式在 Python 3 里无法运行,暂时没心思更新了。
尽管酷讯推出了秒杀器,不过还是觉得不妥,一是没任何输出,谁知道它是否真的能秒到,二是不跨平台,在 Mac 和 Linux 上暂时无法使用。
Patches are welcome. :-)
#!/usr/bin/python # encoding: utf-8 # # Catch the yellow cattles script # # Author: Xuqing Kuang <xuqingkuang@gmail.com> # New features: # * Use regexp to instead of SGMLParser for performance # * Polling multiple URL at one time. # * Print out the redirect URL. # * Basic packages compatible with Python 3 # TODO: # * Use one regexp to split the href and text of link # * Update re package usage to compatible with Python 3 import time import os import re try: import urllib2 as urllib except ImportError: # Python 3 compatible import urllib.request, urllib.error urls = ( "http://piao.kuxun.cn/beijing-jinggangshan/", "http://piao.kuxun.cn/beijing-jian/", ) keyword = '3张' sequence = 60 class TrainTicket(object): """ Catch the yellow cattle """ def __init__(self, urls, keyword, sequence = 60): self.urls = urls self.keyword = keyword self.sequence = sequence self.cache=[] self.html = '' self.links = [] if hasattr(urllib, 'build_opener'): self.opener = urllib.build_opener() else: # Python 3 compatible self.opener = urllib.request.build_opener() self.result = [] self.re_links = re.compile('<a.*?href=.*?<\/a>', re.I) # self.re_element = re.compile('', re.I) # Hardcode at following self.requests = [] for url in urls: if hasattr(urllib, 'Request'): request = urllib.Request(url) else: # Python 3 compatible request = urllib.request.Request(url) request.add_header('User-Agent', 'Mozilla/5.0') self.requests.append(request) def get_page(self, request): """ Open the page. """ try: self.html = self.opener.open(request).read() except urllib.HTTPError: return False return self.html def get_links(self, html = ''): """ Process the page, get all of links """ if not html: html = self.html self.links = self.re_links.findall(html) return self.links def get_element(self, link = ''): """ Process the link generated by self.get_links(). Return list of the href and text """ # FIXME: have no idea how to split the href and text with one regex # So use two regex for temporary solution href = re.findall('(?<=href=").*?(?=")', link) # Get the href attribute if not href: # Process the no href attr href = [''] text = re.split('(<.*?>)', link)[2] # Get the text of link a. href.append(text) # Append to the list. return href def get_ticket(self, request = None): """ Generate the data structure of tickets for each URL. """ if not request: request = self.requests[0] self.get_page(request) self.get_links() i = 0 while i < len(self.links): link = self.get_element(self.links[i]) if not link: continue url = link[0] name = link[1] if name and name.find(keyword) >= 0 and url not in self.cache: self.result.append(( i, name, url, )) self.cache.append(url) i += 1 return self.result def print_tickets(self): """ Process all of URLS and print out the tickets information. """ while 1: self.result = [] try: print('Begin retrive') for request in self.requests: print('Begin scan %s' % request.get_full_url()) self.get_ticket(request) print('Found %s urls.' % len(self.links)) for r in self.result: print('Index: %s\nName: %s\nURL: %s\n' % ( r[0], r[1], r[2] )) print('Scan finished, begin sleep %s seconds' % self.sequence) time.sleep(self.sequence) except KeyboardInterrupt: exit() except: raise if __name__ == '__main__': tt = TrainTicket(urls, keyword, sequence) tt.print_tickets()
Jan 19, 2011 04:23:40 AM
话说介个是 python trainticket.py 就行了,还是得扔到apache里?
Jan 19, 2011 04:25:13 AM
@Beetle: 直接执行就行了,不是 CGI 脚本。
运行后提示如以下:
$ python ./train.py
Beign retrive
Beign scan http://piao.kuxun.cn/beijing-jinggangshan/
Found 140 urls.
Index: 78
Name: [转让] 北京-井冈山 硬卧 3张 发车日期:1-26
URL: http://tongji.kuxun.cn/leads.php?channel=huoche&type1=transfer&type2=&cid=huoche.com&lb=&v=2.5&url=http%3A%2F%2Fhuoche.kuxun.cn%2Fleads.php%3Faction%3Dpiao%26method%3DjumpUrl%26url%3DGIZ0OV%252F55Iqm8WnDpn89kpkwxSbwuggmyPWQt%252BB9Bre2Z3PBI%252FxC7ufdZ2SOYzsj77loAkSV2vF%252B0OzLne0O78EDSUQgZMMR5QsZYlJ7BSM%253D
Jan 19, 2011 04:31:36 AM
@K*K: 不是一定要Python3支持的吧? 我这一运行就这样了
<br />
Beign retrive
Beign scan http://piao.kuxun.cn/wuxi-nanjing/
Found 107 urls.
Scan finished, begin sleep 60 seconds
<br />
然后接着又来一遍……
Jan 19, 2011 04:36:52 AM
汗,原来是我的问题, url 错误了。
Jan 19, 2011 04:48:04 AM
@Beetle: 嗯,地址很重要 :-)
Python 3 支持还不完全,所以尽量用 Python 2.5 以上版本,我是用 Fedora 14 的 Python 2.7。估计明年会正式加入 Python 3 吧,哈~
Jan 19, 2011 04:48:48 AM
@Beetle: URL 最好从酷讯,找好二手票地址后,把地址直接粘贴过来。
Jan 20, 2011 03:28:14 AM
我这边windows下运行后显示
Beign retrive
Beign scan http://huoche.kuxun.cn/zhuanrang-beijing-guangzhou.html
Found 107 urls.
Scan finished, begin sleep 60 seconds
但是没有显示多少张票和时间,求指点
Jan 20, 2011 03:59:22 AM
@cch: 你把 keyword 那行改成空值 '' 试试,那个是搜索的关键字,如果执行正常的话会打印出所有的链接,keyword 的就是搜索链接里的字符串,符合输出的才会打印出来。
也有可能跟 Windows 兼容性方面有问题,如果可能你最好能改了并提个 patch,我没测试过 Windows 的兼容性。 :-)
Mar 08, 2011 01:47:16 PM
这都可以。。。想起来之前有人写的一个BURSTNET强主机的脚本了 。 。。
Mar 23, 2011 05:43:55 PM
@EmiNarcissus: 有什么不可以呢? :-)