2020-07
8

过去的纠结开花了

By xrspook @ 9:38:16 归类于: 烂日记

当我以死磕式全包围的方法学习Think Python和做里面的习题的时候,网友觉得我完全不需要这样,那样太学生了,只需用到什么就学什么就可以了,但实际上,我觉得我这样做挺好。虽然我学的时候并不知道我除了这样还能怎样。看完书,但不会做里面的习题,我这书算算看好了吗?我要上手python,得到达条件反射的程度,不经过练习根本不行。我完全明白知道了理解了某些东西,但没有经过实践会是一个什么状态。

初中前两年我的英语老师从前是大学老师,后来不知怎的沦落到在一所臭名昭著的中学里教英语。他从来不给我我们布置作业,也不会有什么小测之类的东西,所以一年下来除了期中考试和期末考试,我们见不到习题。的确,开学的时候除了教材以外还会发1本练习题,很薄的一本,但做不做,做到什么程度,老师是放任自流的。他讲课完全么有问题,语法解释得很到位,字也写得非常漂亮,但他这种教法除非遇到非常自觉刷题的学生,遇到我们这种完全不自觉的,简直是必死无疑。我觉得小学毕业的时候,我的英语是很不错的,尤其是英语听力,因为某年寒假我还被选去某个地方上某个班强化过,但初中的前两年我算是被荒废了。要用得溜,无论你心法掌握得多好,依然需要大量的练习。中考的几门科目里,我的英语是最低分的。当我上了一所重点高中以后,我更加领会到我跟其他同学的差距,因为我所经历过的习题实在比他们少太多了!为什么当年我会刷物理的参考书,刷化学的参考书,却从来不刷英语的参考书和习题呢???英语这个东西和理科有一定的差别,理科你掌握了原理以后或许还能八九不离十,即便某些题目里有些综合,见过和没见过顶多是反应快慢的差别,但思路还是摆在那里的。英语这个东西有什么规定搭配,什么默认的例外,什么俚语,没遇到根本没法玩,我的跟头就是栽在了这里。如果可以重来,英语这个鬼东西我一定会刷很多题,而且还必须建立错题本。海量的阅读量和海量的单词量外加各种写作技巧拼凑起来,英语考试是可以拿高分的。不过话说回来,英语考试高不高分我完全无所谓,大概从高中开始,我就觉得英语是一种工具,实用至上。跟外国人交流的时候你根本不需要用非常正确的英语他们都能理解,所以其实六十分跟九十分没啥区别,都能解决问题。

学习python我之所以要做习题是因为Think Python那本书里面的实例其实非常少,知识点也不是真的全部都涵盖了。融会贯通这种东西完全靠做习题去达成。近段时间,我在写博客导出数据的转换脚本。的确,一些功能性的东西,我还是得去搜索,边学边用,但那些最基础的思路,是在我做习题的时候印到我脑子里的。比如我要筛选标签,正常人的第一个反应肯定是找筛选的函数,如果标签关键词只有一个,用find就好,但如果多个,那就得用正则了。我一开始也是怎么干的,但后来我发现这样会带入一些莫名其妙的路人甲,我需要准确匹配关键词,所以先建立列表,然后用if str in list就能完美解决的问题。如果标签数量多,还可以用字典替换列表,in在字典里的索引速度杠杠的。如果我不曾在字符串、列表和字典那里下过功夫,我怎么可能会有这种思路。习题的确折磨了我好段时间,但那都是值得的。

过去的纠结,让我现在顺畅。

2020-06
10

shelf这只鬼

By xrspook @ 9:52:26 归类于: 烂日记

连题目都看不懂到底要做什么,解答那道题当然是无从说起,但是我还是硬着头皮去做了。用我理解的那个方式去做。本来我没有打算看参考答案,我是去看另一道题的参考答案的,参考答案没看懂,顺便把上一题的参考答案下载回来,结果发现,那个我看不懂的单词的确是个人家觉得你应该知道,但实际上我毫不知情的东西。shelf中文翻译很好理解,就是柜子嘛,但是柜子是干嘛的呢?这到底纯粹是某个单词,某个函数,某个字典,还是什么东西呢?当我看到参考答案的文件的命名后,我有点明白了,那个估计是一个数据库。我直接拿着那个单词去问我的网友,他也没反应过来,这到底是什么东西?他没学过python,他学过其他编程语言。这就证明了,其它编程语言里是没有这个东西的。写Think Python这本书的人默认我们都知道shelf是什么。在那个单词出现之前,那一章书里没有出现过那个东西,我看的那章书是第14章,前面13章也半个字没有提及这个单词到底意味着什么。情况就好像,你在没有学过python的人面前说元组,人家完全不知道你在说什么。之前的习题,如果遇到这种情况,写书的会在题目后面提醒那是个什么东西,读者可以自己从某个链接那里了解这个玩意,但这道题他们半个字都没有提醒,所以我真的很怀疑翻译Think Python这本书的中国人到底有没有看懂这个单词。如果他们看懂了,至少他们应该提醒一下读者,这实际上是要他们把字典里的映射放到数据库里面,而那个数据库又不是真的传统意义上的数据库。要解释这种东西,的确用三言两语无法说清。即便我已经看过中文版Python手册里面介绍shelf的部分,但我觉得自己还是没搞懂到底那是什么。

按照参考答案的写法,我在自己的程序里先加入了一个建立数据库的语句,然后再增加shelf的处理。我不知道到底是怎么回事,因为终端里光标就一直停在那个地方,好像卡机一样,当我关掉软件以后,脚本的文件夹里面多了一些数据库文件。我不知道那到底是什么,但显然里面有很多东西。其中一个dir文件,有100多KB,而另外一个数据库的缓存文件,接近30MB,我不知道哪来那么多的内容。大概我应该把后缀改一改,然后用Access打开看一下里面到底有些什么神奇的玩意。因为这个数据库很大,所以我在终端里就看到光标卡在那里。为什么python里的字典秒杀就能显示完毕的东西建立数据库居然这么庞大呢?可想而知,在字典里可以秒杀完成的搜索,如果放在数据库里反应时间估计是万倍的区别。这让我想起Excel的VBA里,如果读写的是单元格,那么脚本将非常耗时,但如果把读写的内容先存在数组里面,完成以后一并输出,效率会高非常多,随便高个几百倍算很少了。

高中的时候,我学过Access,但只是老师说什么我就做什么,我只知道一些非常皮毛的东西。Access的精髓是数据库,数据库的灵魂是查询语句,但那时的学习我们只停留在可视化表格操作。

无论精通了哪一门编程语言,所有事情都能用那个方法搞定。有些人学习是为了赚更多的钱,而我努力学习只是因为我想知道、我想实现。

2020-06
5

随机单词扎堆成文

By xrspook @ 14:47:54 归类于: 扮IT

从某本书里随机找单词拼出句子段落。重点是把握好前缀和后缀,前缀要捆绑查找,后缀要关联对应。

Exercise 8: Markov analysis: Write a program to read a text from a file and perform Markov analysis. The result should be a dictionary that maps from prefixes to a collection of possible suffixes. The collection might be a list, tuple, or dictionary; it is up to you to make an appropriate choice. You can test your program with prefix length two, but you should write the program in a way that makes it easy to try other lengths. Add a function to the previous program to generate random text based on the Markov analysis. Here is an example from Emma with prefix length 2: He was very clever, be it sweetness or be angry, ashamed or only amused, at such a stroke. She had never thought of Hannah till you were never meant for me?” “I cannot make speeches, Emma:” he soon cut it all himself. For this example, I left the punctuation attached to the words. The result is almost syntactically correct, but not quite. Semantically, it almost makes sense, but not quite. What happens if you increase the prefix length? Does the random text make more sense? Once your program is working, you might want to try a mash-up: if you combine text from two or more books, the random text you generate will blend the vocabulary and phrases from the sources in interesting ways. Credit: This case study is based on an example from Kernighan and Pike, The Practice of Programming, Addison-Wesley, 1999. You should attempt this exercise before you go on; then you can download my solution from http://thinkpython2.com/code/markov.py. You will also need http://thinkpython2.com/code/emma.txt.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import string
import random
from collections import defaultdict
def set_book(fin1,num):
    d = defaultdict(list) # 默认键值为列表
    l = []
    header = ()
    for line in fin1:
        line = line.replace('-', ' ')
        for word in line.rstrip().split(): # 空格换行为分割,单词存入列表
            l.append(word)
    for i in range(len(l)-num): # 以列表序号逐一推进方式建立字典
        header = (l[i-1],) # 元组header为前缀,做键
        for j in range(i,i+num-1):
            header += (l[j],)
            j += 1
        if l[i+num-1] not in d[header]:
            d[header].append(l[i+num-1]) # 列表后缀做键值
    return d
def next(start, book):
    return random.choice(book[start])
fin1 = open('emma.txt', encoding='utf-8')
prefix_num = 3 # 前缀个数
suffix_num = 100 # 后缀个数
book = set_book(fin1,prefix_num)
start = random.choice(list(book.keys())) # 随机前缀开头
final =  start
for i in range(suffix_num): # 截取最后几个单词为前缀找后缀
    final += (next(final[len(final)-prefix_num:], book),) 
for word in final:
    print(word, end=' ')
# reigns alone. A very proper compliment! and then follows the application, 
# which I think, my dear, you said you had a great deal happier if she had no 
# intellectual superiority to make atonement to herself, or frighten those 
# who might hate her into outward respect. She had never seen her look so well, 
# so lovely, so engaging. There was consciousness, animation, and warmth; 
# there was every appearance of its being all in proof of how much he was 
# in love with, how to be able to return! I shall try what I can do. 
# Harriet's features are very delicate, which makes a likeness
2020-05
2

改变字典规则不香吗?

By xrspook @ 20:55:44 归类于: 扮IT

改变字典的键值规则就可以把从一本书里挑随机单词这件事轻松搞定,我真搞不懂参考答案为啥要那么折腾。在Think Python 2的第十三章里,字典的默认规则是单词是键,词频是键值。既然这道题要唯一的索引找随机单词,我把键值变成唯一序号不就完事大吉了?再来一个zip把字典的键值和键互换,random.choice()直接就到达随机单词了。我只改了生成字典的规则,耗时0.12秒,参考答案折腾了不只一点点,耗时0.42秒。之所以参考答案不修改字典规则,是因为他们要灌输python拼装模块的特性,拼装很方便,但事实证明效率不一定最高。

This algorithm works, but it is not very efficient; each time you choose a random word, it rebuilds the list, which is as big as the original book. An obvious improvement is to build the list once and then make multiple selections, but the list is still big.

An alternative is: Use keys to get a list of the words in the book. Build a list that contains the cumulative sum of the word frequencies (see Exercise 2). The last item in this list is the total number of words in the book, n. Choose a random number from 1 to n. Use a bisection search (See Exercise 10) to find the index where the random number would be inserted in the cumulative sum. Use the index to find the corresponding word in the word list.

Exercise 7: Write a program that uses this algorithm to choose a random word from the book. Solution: http://thinkpython2.com/code/analyze_book3.py.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import string
import random
from time import time
def set_book(fin1):
    useless = string.punctuation + string.whitespace + '“' + '”' # 标点符号、换行符全部咔嚓掉
    d = {}
    i = 1
    for line in fin1:
        line = line.replace('-', ' ') # 有-的单词全部一分为二,这样真的好吗?
        for word in line.split():
            word = word.strip(useless)
            word = word.lower()
            if word not in d:
                d[word] = i # 录入字典的时候键值就是序号
                i += 1
            # d[word] = d.get(word, 0) + 1 # 反正我不算词频,这个没必要了
    return d
fin1 = open('emma.txt', encoding='utf-8')
start = time()
book1 = set_book(fin1)
book2 = dict(zip(book1.values(), book1.keys())) # 键和键值互换,序号成了唯一索引号
print('100 random words in book')
for i in range(100):
    if i > 1 and i%8 == 0:
        print()
    print(random.choice(book2), end=' ') # 索引号找词,想多快有多快
print()
end = time()
print(end - start)
# 100 random words in book
# solicit laughing preserve inebriety elton's unimpeded effusions unselfish
# intimate connect native judges charities travel informs colours
# enigmas bragge case greensward cox's particularly unexampled promise
# prone greensward dignity maps fourth christmas creature maximum
# graver mildest pleasant corrected increased named partridge marks
# following kept gloom conjecturing parlour inheriting say consulting
# magnified abundant produces sons malt add unenforceability beautifully
# richly striking confuse greatness asleep steps humility upon
# already paper delight liberties confide appendages undecided male
# prophecies esteem unadorned likelihood shopping deeply unbiased horrors
# man's dumplings business chapter shakespeare sees counsels attentive
# silenced ventured singular double mean waltzes requisite checks
# unattended qualified blessed surmises
# 0.12100672721862793
2020-04
26

算算书里有多少单词

By xrspook @ 18:12:57 归类于: 扮IT

算算书里有多少单词应该是很大路简单的事,但实际上各种状况层出不穷。有些是你料到的,比如排版的用了全角的标点符号,程序默认会删掉标点符号,万一排版那个没有规范地使用空格呢?有些是你不会料到的,比如手误创造出奇葩字符串。很早以前我就发现Notepad++和Word里算的字数是不一致的,Notepad++通常算出来的数都会大一些。谁对谁错,随缘吧,知道大概差不多也就行了,毕竟高考的时候你写少几个字不到800也不会真扣你的分。

字典和列表的相爱相杀我体会得越来越深刻了。

words.txt在这里,emma.txt在这里。

Exercise 1: Write a program that reads a file, breaks each line into words, strips whitespace and punctuation from the words, and converts them to lowercase. Hint: The string module provides a string named whitespace, which contains space, tab, newline, etc., and punctuation which contains the punctuation characters. Let’s see if we can make Python swear:
>>> import string
>>> string.punctuation
‘!”#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~’
Also, you might consider using the string methods strip, replace and translate.

Exercise 2: Go to Project Gutenberg (http://gutenberg.org) and download your favorite out-of-copyright book in plain text format. Modify your program from the previous exercise to read the book you downloaded, skip over the header information at the beginning of the file, and process the rest of the words as before. Then modify the program to count the total number of words in the book, and the number of times each word is used. Print the number of different words used in the book. Compare different books by different authors, written in different eras. Which author uses the most extensive vocabulary?

Exercise 3: Modify the program from the previous exercise to print the 20 most frequently used words in the book.

Exercise 4: Modify the previous program to read a word list (see Section 9.1) and then print all the words in the book that are not in the word list. How many of them are typos? How many of them are common words that should be in the word list, and how many of them are really obscure?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
import string
fin = open('words.txt')
mydict = {}
for line in fin:
    word = line.strip()
    mydict[word] = ''
file = open('emma.txt', encoding = 'utf-8')
essay = file.read().lower()
essay = essay.replace('-', ' ')
pun = {}
str_all = '“' + '”' + string.punctuation
for x in str_all: # 建立各种标点符号字符的字典
    pun[x] = ''
useless = essay.maketrans(pun) # maketrans必须被替换和替换等长,字典完美解决这个问题
l = essay.translate(useless).split() # 那些含-的单词会死得很惨,但仍然算是个单词
print('this book has', len(l), 'words')
book = {}
for item in l: # 读取文件为字符串,字符串转为单词列表,列表转为计数的字典,单词为键,次数为键值
    book[item] = book.get(item, 0) + 1
list_words1 = sorted(list(zip(book.values(), book.keys())), reverse = True) # 字典转为列表,键与键值换位
print('this book has', len(list_words1), 'different words')
print('times', 'word', sep='\t')
count = 1
word_len = 0 # 限制最小词长
for times, word in list_words1: # 打印大于某长度用得最多的20个词(不限制,3个字母及以下最最简单的会刷屏)
    if len(word) > word_len:
        print(times, word, sep='\t')
        count += 1
    if count > 20:
        break
count = 0
for word in book:
    if word not in mydict:
        # print(word, end=' ')
        count += 1
print(count, 'words in book not in dict') # 结果惨不忍睹,合计590个
# this book has 164065 words
# this book has 7479 different words
# times   word
# 5379    the
# 5322    to
# 4965    and
# 4412    of
# 3191    i
# 3187    a
# 2544    it
# 2483    her
# 2401    was
# 2365    she
# 2246    in
# 2172    not
# 2069    you
# 1995    be
# 1815    that
# 1813    he
# 1626    had
# 1448    as
# 1446    but
# 1373    for
# 590 words in book not in dict
# -----------------------------解法二----------------------------- 其实就是切单词方法有差异
import string
def set_book(fin1):
    useless = string.punctuation + string.whitespace + '“' + '”'
    d = {}
    for line in fin1:
        line = line.replace('-', ' ')
        for word in line.split():
            word = word.strip(useless)
            word = word.lower()
            d[word] = d.get(word, 0) + 1
    return d
def set_dict(fin2):
    d = {}
    for line in fin2:
        word = line.strip()
        d[word] = d.get(word, 0) + 1
    return d
fin1 = open('emma.txt', encoding='utf-8')
fin2 = open('words.txt')
book = set_book(fin1)
mydict = set_dict(fin2)
l = sorted(list(zip(book.values(), book.keys())), reverse=True)
count = 0
for key in book:
    count = count + book[key]
print('this book has', count, 'words')
print('this book has', len(book), 'different words')
num = 20
print(num, 'most common words in this book')
print('times', 'word', sep='\t')
for times, word in l:
    print(times, word, sep='\t')
    num -= 1
    if num < 1:
        break
count = 0
for word in book:
    if word not in mydict:
        # print(word, end=' ')
        count += 1
# print()
print(count, 'words in book not in dict')
# this book has 164120 words
# this book has 7531 different words
# 20 most common words in this book
# times   word
# 5379    the
# 5322    to
# 4965    and
# 4412    of
# 3191    i
# 3187    a
# 2544    it
# 2483    her
# 2401    was
# 2364    she
# 2246    in
# 2172    not
# 2069    you
# 1995    be
# 1815    that
# 1813    he
# 1626    had
# 1448    as
# 1446    but
# 1373    for
# 683 words in book not in dict
© 2004 - 2021 我的天 | Theme by xrspook | Power by WordPress