...
Just my blog

Blog about everything, mostly about tech stuff I made. Here is the list of stuff I'm using at my blog. Feel free to ask me about implementations.

Soft I recommend
Py lib I recommend

I'm using these libraries so you can ask me about them.

Python HTMLParser and Vkontakte randomizer

Finally I've finish my first "program" on Python.py_vk The task is to parse people's id from web page where reposter's id stores. Main problems were:

  • web-page code is loading dynamically so there is no simple way to get ids from it, the best solution was - save section where id stores in .html file
  • I wanted to catch id + nickname but list of pairs was not a good decision when random works
  • I can't create a list which stores all found ids, it wiped every iteration
  • I have some unsupported chars in nicknames and they'd broke iteration
  • I've get a lot of junk while scan .html so I used regex to avoid them
  • I can't add various ids in list without adding one id to list recursively - and guess what? Yes, it's broke the iteration

What I've learned: Here will be a huge list of different things for indexing for further search. [su_spoiler title="List of topics"] (let google parse it, so you can find this in future)

  • How to open file in Python
  • How to make global variables in Python
  • How to parse html in Python
  • How to sort variable with regex in Python
    • re.findall in Python
    • re.match in Python
  • Construction 'for' in python
    • How to make a replace for character in Python
  • Construction 'if' in Python
  • Construction 'else' in Python
  • What is string in Python
  • What is list in Python
  • How to add something to list in Python
    • list.append in Python
    • list.extend in Python
    • list.insert in Python
  • How to export data to csv in Python
  • How to get random in Python
  • How to print in Python
  • How to remove unprintable symbols in Python
  • Convert string to list in Python

[/su_spoiler] [su_quote]I will show you my drafts, some of them, usually, can looks not clear and readable, but please do not blame me, I just start it from nothing, I didn't read any guide like 'Python for gentlemen' so my code can looks rude.[/su_quote] Here is my 'most last last try' where all topics are present:

with open('test.html', 'r', encoding='utf-8') as content_file:
    read_data = content_file.read()
'''
1. Replaced error with charset by replace character
2. Change the way how print was formatted
2.1 Added random - but still not used
3. Added CSV export tool
'''
from html.parser import HTMLParser
import re, sys, random, csv

'''
Global variables here
global vk_read
global vk_name
global men
'''

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        vk_id = str(attrs)
        for line in vk_id:
            vk = re.findall('/\S+$', vk_id)
        vk_fnd = str(vk)
        if re.search('/\w+\'\)\]', vk_fnd):
            global vk_read
            vk_read = vk_fnd
            for ch in ['/', ')', '[', ']', '"', "'"]:
                if ch in vk_read:
                    vk_read = vk_read.replace(ch, "")
                else:
                    pass
    def handle_data(self, data):
        global vk_name
        vk_name = str(data)
        for line in vk_name:
            if re.match('\S+\s+\S+$', vk_name):
                for ch in ['\u0456', '\u0406']:
                    if ch in vk_name:
                        vk_name = vk_name.replace(ch, "?")
                if vk_name:
                    if vk_read:
                        global men
                        men = '@'+vk_read+' '+vk_name
                        print(men)
                        # men = list('@'+vk_read+' '+vk_name)
                        # men_list = men.split()
                        # men_list.append(men_list)
                        with open('vk_winners.csv', 'w', encoding='utf-8', newline='') as csvfile:
                            write = csv.writer(csvfile, delimiter=' ')
                            for _ in men:
                                write.writerow([men])
                                break
                        break
                    else:
                        print('ERROR no id found')
                else:
                    print('ERROR no name found')
            else:
                break
parser = MyHTMLParser()
parser.feed(read_data)

So you can see what I`m trying to do and how. At the end of this post you'll see the last version worked. Lets dive in topics one by one:

How to open file in Python:

with open('test.html', 'r', encoding='utf-8') as content_file:
    read_data = content_file.read()
content_file.closed
parser = MyHTMLParser() 
parser.feed(read_data)

This construction will open file for read, but usually it can produce encoding errors, so I've add 'encoding=utf-8' to protect from them. Usually you should close the file, but I didn't use it in my task because it short and will finish job as soon as find all ids

How to make global variables in Python

global vk_read

Just add 'global' in body of script, before you give any necessary value to it.

How to parse html in Python

For Python 3.4 you'll use HTMLParser library. It can parse almost all tags from the raw html and you can do nothing else but just sort them. I have sort it using lists and regex. Do not forgot to read all the docs, for example, I've struggle a lot, because lost this 'The attrs argument is a list of (name, value) pairs' from doc.

How to sort variable with regex in Python

In my situation I have different way to sort it 're.findall('/\S+$', href)' + if re.search('/\w+\'\)\]', id_raw): and 'if re.match('\S+\s+\S+$', vk_name):'

  • re.findall helps me to find all values from numbers or raw strings with tag, not sort them, just find it by given pattern '('/\S+$', href)' and keep it for further processing
    • Before:
      [('href', '/dimka_keystin')]
      [('class', 'like_row_cont inl_bl')]
      [('href', '/yana_lyubchenko'), ('class', 'like_img_cont')]
      [('width', '100'), ('height', '100'), ('src', 'https://pp.vk.me/c625428/v625428926/2f4c9/EGjgXLGiMkg.jpg')]
      []
      [('href', '/yana_lyubchenko')]
      [('class', 'like_row_cont inl_bl')]
      [('href', '/id168233095'), ('class', 'like_img_cont')]
      [('width', '100'), ('height', '100'), ('src', 'https://pp.vk.me/c412728/v412728095/33b3/Q9scL5rbFWM.jpg')]
      
    • After:
      ["//pp.vk.me/c625730/v625730549/2b9fe/nG8MaWjdEeA.jpg')]"]
      []
      ["/dimka_keystin')]"]
      []
      []
      ["//pp.vk.me/c625428/v625428926/2f4c9/EGjgXLGiMkg.jpg')]"]
      []
      ["/yana_lyubchenko')]"]
  • re.search helps to find each symbol from previous result and then make action on each of them
    • Found by pattern:
      ["/dimka_keystin')]"]
      ["/yana_lyubchenko')]"]
    • Then each not needed character replaced with null
      ["dimka_keystin')]"]
      ["dimka_keystin']"]
      "dimka_keystin']"]
      "dimka_keystin'"
      dimka_keystin'
      dimka_keystin
      ["yana_lyubchenko')]"]
      ["yana_lyubchenko']"]
      "yana_lyubchenko']"]
      "yana_lyubchenko'"
      yana_lyubchenko'
      yana_lyubchenko
  • if re.match help me also to match only given pattern results. 'def handle_data(self, data):' has a lot of null strings, so I've sorted it and also remove all not unicode symbols like in above example
    • Before:
      Димон Димоныч
      
      
      ...
      Яна Любченко
      
      
      ...
    • After:
      Димон Димоныч
      Яна Любченко

Construction 'for' in python

                for ch in ['\u0456', '\u0406']:
                    if ch in vk_name:
                        vk_name = vk_name.replace(ch, "?")

Can help you to make loop till 'something' found in 'something2' or make 'action' for each 'line, string, list' from given variable until it ends.

Construction 'if' in Python

#THIS
for ch in ['\u0456', '\u0406']:
    if ch in vk_name:
        vk_name = vk_name.replace(ch, "?")

#OR THIS
if re.search('/\w+\'\)\]', vk_fnd):

#OR THIS
for vk_id in vk_read:
    if vk_id not in vk_ids:
        vk_ids.append(vk_read)

Can help you to make some action if something is true, if something is found by pattern, if something is not in list. 'elif' - is just another variant of 'if', IF this 'if' cannot be found and pattern can be different.

Construction 'else' in Python

Make the same job as above but if something is not true, was not found or not present in list.

How to make a replace for character in Python

and

How to remove unprintable symbols in Python

Simple example:

Replace each in '["/dimka_keystin')]"]' where any of this ['/', ')', '[', ']', '"', "'"] found:

for ch in ['/', ')', '[', ']', '"', "'"]:
    if ch in vk_read:
        vk_read = vk_read.replace(ch, "")

Replace some not unicode chars from list of names:

    for ch in ['\u0456', '\u0406']:
        if ch in vk_name:
            vk_name = vk_name.replace(ch, "?")

How to add something to list in Python

Different way I've found when working on it, but the best solution for my example is: list.append() this will add value to the end of list and it can collect all founded values as I need it in this task

for vk_id in vk_read:
    if vk_id not in vk_ids:
        vk_ids.append(vk_read)

list.insert(i, x) - can add value to the any needed location on list, but it can erase previous which stored there and also it work slowly. list.extend(L) - helps me to add list in list in lists but it can produce a lot of lists in one, this is not useful for my example, because python random can show something that I do not need to.

How to export data to csv in Python

In my example I've just declare variable with needed result, this variable stores the list of people ids and then it can be write in file. Here I get 'random_id' from id list 'vk_ids' but I can also export any data from any variable, just change 'random_id' to 'vk_ids' in write.writerow([random_id]) and I will get list of all founded ids. I have add brackets [] to declare it as list.

random_id = random.choice(vk_ids)
with open('vk_winners.csv', 'w', encoding='utf-8') as csvfile:
    write = csv.writer(csvfile, delimiter=' ')
    write.writerow([random_id])

I did not close the file again, because it will close after script finished work.

How to get random in Python

As described above, just use the variable with list and add 'random.choice()'

for vk_id in vk_read:
    if vk_id not in vk_ids:
        vk_ids.append(vk_read)
        break
random_id = random.choice(vk_ids)

Convert string to list in Python

In my situation I just need to declare variable with(as) empty list above the for construction and then add to it all strings from each iteration.

vk_ids = []
    for vk_id in vk_read:
        if vk_id not in vk_ids:
            vk_ids.append(vk_read)
            break
    random_id = random.choice(vk_ids)

That's all for now, folks, I need go. Thanks for watching! This is how I finished it:

'''
1. Replaced error with charset by replace character
2. Change the way how print was formatted
2.1 Added random - used range from list of ids
3. Added CSV export tool for one man
'''
'''
Global variables here
global vk_read
'''

from html.parser import HTMLParser
import re, sys, random, csv

with open('test.html', 'r', encoding='utf-8') as content_file:
    read_data = content_file.read()
content_file.closed

vk_ids = []
vk_men = []

from html.parser import HTMLParser
import re, sys, random, csv

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        global vk_read
        href = str(attrs)
        for line in href:
            id_tag = re.findall('/\S+$', href)
            id_raw = str(id_tag)

            if re.search('/\w+\'\)\]', id_raw):
                vk_read = id_raw
            else:
                break
            for ch in ['/', ')', '[', ']', '"', "'"]:
                if ch in vk_read:
                    vk_read = vk_read.replace(ch, "")

            # http://stackoverflow.com/questions/30328193/python-add-string-to-a-list-loop
            for vk_id in vk_read:
                if vk_id not in vk_ids:
                    vk_ids.append(vk_read)
                    break
            random_id = random.choice(vk_ids)
            with open('vk_winners.csv', 'w', encoding='utf-8') as csvfile:
                write = csv.writer(csvfile, delimiter=' ')
                write.writerow([random_id])
            # print(vk_ids)
            break

parser = MyHTMLParser()
parser.feed(read_data)