Going deeper with Python or HTMLParser and Vkontakte randomizer comes back!

Hello, for anybody who read this blog. Last time I’m trying to parse saved HTML page to get Vkontakte ids and randomly select one of them each time: here and here. Now I’ll try to go deeper and use different way to extract data from life webpage without sawing it to the folder with python script. For my opinion, using some googling I should use this: http://docs.python-guide.org/en/latest/scenarios/scrape/ http://stackoverflow.com/questions/2081586/web-scraping-with-python later I will add some more KB The small plan: Add URL of parsed page: to txt file – and them get it from file to python to console, after python request to user Get all found ids and save it list to file ids.csv optionally with Name+id or just id if names will produceShort Read more…

Python HTMLParser and Vkontakte randomizer

Finally I’ve finish my first “program” on Python. The task is to parse people’s id from web page where reposter’s id stores. Main problems were: web-page code is loading dynamically so there is no simple way to get ids from it, the best solution was – save section where id stores in .html file I wanted to catch id + nickname but list of pairs was not a good decision when random works I can’t create a list which stores all found ids, it wiped every iteration I have some unsupported chars in nicknames and they’d broke iteration I’ve get a lot of junk while scan .html so I used regex to avoid them I can’t add various ids in list withoutShort Read more…

Python HTMLParser

How to spent two days if you know nothing about Python: need parse HTML page code, where VK id and username of every person who shared post stores   with open(‘test.html’, ‘r’, encoding=’utf-8′) as content_file: read_data = content_file.read() from html.parser import HTMLParser import re class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): vk_id = str(attrs) for line in vk_id: vk = re.findall(‘/\S+$’, vk_id) vk_fnd = str(vk) if re.search(‘/\w+\’\)\]’, vk_fnd): global vk_read vk_read = vk_fnd for ch in [‘/’, ‘)’, ‘[‘, ‘]’, ‘”‘, “‘”]: if ch in vk_read: vk_read = vk_read.replace(ch, “”) def handle_data(self, data): global vk_name vk_name = str(data) assert isinstance(data, object) for line in vk_name: if re.match(‘\S+\s+\S+$’, vk_name): print(“@{0} – {1}”.format(vk_read, vk_name)) break parser = MyHTMLParser() parser.feed(read_data) Now I know more.Short Read more…