2020-06
5

随机单词扎堆成文

By xrspook @ 14:47:54 归类于: 扮IT

从某本书里随机找单词拼出句子段落。重点是把握好前缀和后缀,前缀要捆绑查找,后缀要关联对应。

Exercise 8: Markov analysis: Write a program to read a text from a file and perform Markov analysis. The result should be a dictionary that maps from prefixes to a collection of possible suffixes. The collection might be a list, tuple, or dictionary; it is up to you to make an appropriate choice. You can test your program with prefix length two, but you should write the program in a way that makes it easy to try other lengths. Add a function to the previous program to generate random text based on the Markov analysis. Here is an example from Emma with prefix length 2: He was very clever, be it sweetness or be angry, ashamed or only amused, at such a stroke. She had never thought of Hannah till you were never meant for me?” “I cannot make speeches, Emma:” he soon cut it all himself. For this example, I left the punctuation attached to the words. The result is almost syntactically correct, but not quite. Semantically, it almost makes sense, but not quite. What happens if you increase the prefix length? Does the random text make more sense? Once your program is working, you might want to try a mash-up: if you combine text from two or more books, the random text you generate will blend the vocabulary and phrases from the sources in interesting ways. Credit: This case study is based on an example from Kernighan and Pike, The Practice of Programming, Addison-Wesley, 1999. You should attempt this exercise before you go on; then you can download my solution from http://thinkpython2.com/code/markov.py. You will also need http://thinkpython2.com/code/emma.txt.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import string
import random
from collections import defaultdict
def set_book(fin1,num):
    d = defaultdict(list) # 默认键值为列表
    l = []
    header = ()
    for line in fin1:
        line = line.replace('-', ' ')
        for word in line.rstrip().split(): # 空格换行为分割,单词存入列表
            l.append(word)
    for i in range(len(l)-num): # 以列表序号逐一推进方式建立字典
        header = (l[i-1],) # 元组header为前缀,做键
        for j in range(i,i+num-1):
            header += (l[j],)
            j += 1
        if l[i+num-1] not in d[header]:
            d[header].append(l[i+num-1]) # 列表后缀做键值
    return d
def next(start, book):
    return random.choice(book[start])
fin1 = open('emma.txt', encoding='utf-8')
prefix_num = 3 # 前缀个数
suffix_num = 100 # 后缀个数
book = set_book(fin1,prefix_num)
start = random.choice(list(book.keys())) # 随机前缀开头
final =  start
for i in range(suffix_num): # 截取最后几个单词为前缀找后缀
    final += (next(final[len(final)-prefix_num:], book),) 
for word in final:
    print(word, end=' ')
# reigns alone. A very proper compliment! and then follows the application, 
# which I think, my dear, you said you had a great deal happier if she had no 
# intellectual superiority to make atonement to herself, or frighten those 
# who might hate her into outward respect. She had never seen her look so well, 
# so lovely, so engaging. There was consciousness, animation, and warmth; 
# there was every appearance of its being all in proof of how much he was 
# in love with, how to be able to return! I shall try what I can do. 
# Harriet's features are very delicate, which makes a likeness
2020-06
5

上路

By xrspook @ 8:29:07 归类于: 烂日记

我已经不记得对上一次,写python是什么时候的事了,感觉好遥远,起码一个多月以前。具体时间,我实在记不清了,但是我依然记得,上一次我卡在了哪里,我应该在哪里重新开始。当时我看到的是第14章,但实际上第13章的内容我还没有全部消化掉,前面的那些我花的时间还多一点,后面的那些简直就是囫囵吞枣。第13章最后一道练习题,我觉得自己是无论如何不会去想的了,因为我根本不知道题目到底要我做些什么,之所以这样,大概是因为我的数学学得不好,所以我无法理解题目的意思。但是倒数第二道题目,我觉得自己还是可以做到的。

那是一道从一本书里随机的选择某些单词组成一些可能有意思的句子。随机拼凑句子语意当然乱来,但是如果能保证单词前面和后面相对稳定,那么起码单词组合起来会有某些意思,虽然可能句子的意思还是很无厘头。随着前面后面单词的整体性加强,整个句子的意思也会越发明了。这其实就是一个靠着前缀找后缀的运行模式。开始的时候默认的前缀是两个单词。由前面的两个单词找出后面一个单词,然后再利用后面的两个单词找下一个单词,如此类推。这种方法理论上可以扩展为结合前面N个单词找后面一个单词,然后再撇掉第1个单词,继续找下一个。思路不复杂,但是该用什么实现这个呢?的确是需要点心思的是Think Python那本书没有把所有方法都告诉你,在最终写出这道题目的解答之前,我看过他们的答案,但我觉得自己没看懂,因为里面加入了很多书里之前根本没说过的东西。里面默认带入了很多他们认为你必须知道,所以无需解释的东西。如果这是一本传统的教程,这简直让人日子没法过了!做这本书的习题的时候,我也吐槽过无数次,他们会无底线地超纲。但也正是因为这些说来就来的超纲,让你除了要看这本书以外,你还必须动脑筋,还必须自己手动去搜索解决方法,找那些他们觉得你一定得懂,但实际上他们又没说的东西。最终我写出了我想要的东西,至于结果跟他们的差多远,我没有比较。很多人说python是一种类似于乐高积木的编程,是一个模块叠加一个模块的。但是里面的递归却让我很头晕,所以当参考答案用上全局函数,用上递归的时候,我选择的依然是循环,依然是在主函数里输出那些东西,同时也在一句话里面嵌套了好几个我想做的事。我当然可以把我嵌套的东西单独出来定制一个函数,但是一句话能说清的事情我不想再写几行,虽然在用的时候,多写几行可能会调取得方便一些。现在我之所以不这么干,是因为我要实现的功能暂时来说还很简单。我用一句话就实现了,只不过嵌套了好几个参数而已,Excel的函数也是这么玩的。虽然有些时候,我也会狠狠地吐槽那些几万公里那么长的Excel函数公式。

我从来没想过,自己能在半天之内解决一个之前我曾经想过但是却没想出解决办法的问题。

2020-04
24

用两天琢磨一道题

By xrspook @ 20:32:30 归类于: 扮IT

前面还在沾沾自喜我写出来的脚本运行效率战胜了参考答案,但这道题目我是看着参考答案都不知道他们在说什么。如果只是一个词,我的确可以列举出它一次减少一个字母可以出现的所有可能,但怎么知道上一层可能和这一层的哪个配套???我花了2天时间去研究、消化答案。一边搞清楚答案为什么这样,另一边考虑有没有其它容易吃透的表达方式。这道题之所以让我非常纠结,根本的原因是我想不透到底我可以用什么手段实现。没有可以实现的逻辑,就不会有可行的编程。

Exercise 4: Here’s another Car Talk Puzzler (http://www.cartalk.com/content/puzzlers): What is the longest English word, that remains a valid English word, as you remove its letters one at a time? Now, letters can be removed from either end, or the middle, but you can’t rearrange any of the letters. Every time you drop a letter, you wind up with another English word. If you do that, you’re eventually going to wind up with one letter and that too is going to be an English word—one that’s found in the dictionary. I want to know what’s the longest word and how many letters does it have? I’m going to give you a little modest example: Sprite. Ok? You start off with sprite, you take a letter off, one from the interior of the word, take the r away, and we’re left with the word spite, then we take the e off the end, we’re left with spit, we take the s off, we’re left with pit, it, and I. Write a program to find all words that can be reduced in this way, and then find the longest one. This exercise is a little more challenging than most, so here are some suggestions: You might want to write a function that takes a word and computes a list of all the words that can be formed by removing one letter. These are the “children” of the word. Recursively, a word is reducible if any of its children are reducible. As a base case, you can consider the empty string reducible. The wordlist I provided, words.txt, doesn’t contain single letter words. So you might want to add “I”, “a”, and the empty string. To improve the performance of your program, you might want to memoize the words that are known to be reducible. Solution: http://thinkpython2.com/code/reducible.py.

最终,我觉得自己总算消化了,顺便画了个思维导图帮助大家理解到底分解到什么程度叫做完成,什么状态叫做分解失败。[”]和[]是两种不同的东西!!!!!!

is_reducible()是最关键的函数,memos用在这里,memos初始设置了known[”] = [”]也很关键,这是个守卫模式,没有守卫is_reducible()根本没法玩。这个脚本里的5个函数,除了一开始的创建字典函数,其余函数都可以单独测试,把一个固定单词放进去脚手架测试,可以帮助理解。cut_letter(),is_reducible()和all_reducible()这三个函数最终返回的都是列表,它们的样式都是类似的。希望我理解过程中的注释能帮助到有需要的人。PS一句:参考答案的打印效果让人很晕,我修改版的打印效果很美丽:)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
from time import time
def set_dict(fin):
    d = {}
    for line in fin:
        word = line.strip()
        d[word] = 0
    for word in ['a', 'i', '']:
        d[word] = 0
    return d
def cut_letter(word, d): # 生成子单词,返回列表
    l = []
    for i in range(len(word)):
        new_word = word[:i] + word[i+1:]
        if new_word in d:
            l.append(new_word)
    return l # ['']长度为1,[]长度为0,无子词不能分解时返回[],'a'返回['']
def is_reducible(word, d): # 判断能否生成无限子单词,返回列表
    if word in known: # 守卫模式下,''空字符串被列入初始字典,不列入永远会被递归到[],无果
        return known[word]
    # if word == '': # 不用memos的时候,需要加入这句守卫
    #     return ['']
    l = []
    for new_word in cut_letter(word, d):
        if len(is_reducible(new_word, d)) > 0:
            l.append(new_word)
    known[word] = l
    return l
def all_reducible(d): # 收集所有无限子单词的单词,返回列表
    l = []
    for word in d:
        if len(is_reducible(word, d)) > 0: # 有列表,即有无限子单词
            l.append((len(word), word)) # 列表含有N个元组,元组里有2个元素,1为单词的字母数量,2为单词本尊
    new_l = sorted(l, reverse = True) # 每次减少一个字母,单词的字母越多当然就能降解出越多层了
    return new_l
def word_list(word): # 打印单词及子单词
    if len(word) == 0: # 最后一个进入is_reducible()的是[''],对应l[0]为无,打印结束
        return
    print(word)
    l = is_reducible(word, d) # 因为是被鉴定过词汇表里的词,所以必定有无限子单词
    word_list(l[0]) # 子单词有多个时只选第1个
known = {} # memos实际上只在is_reducible()起作用,除了提高效率,还能用作守卫
known[''] = [''] # 因为is_reducible()返回的是列表,所以即便是空字符串,键值也必须是列表!
fin = open('words.txt')
start = time()
d = set_dict(fin) # 普通的字典,键为单词,键值为0
words = all_reducible(d) # 列表,元组,2元素
for i in range(5):
    word_list(words[i][1]) # 列表里第某个元组的第2个元素
end = time()
print(end - start)
# complecting
# completing
# competing
# compting
# comping
# coping
# oping
# ping
# pig
# pi
# i
# twitchiest
# witchiest
# withiest
# withies
# withes
# wites
# wits
# its
# is
# i
# stranglers
# strangers
# stranger
# strange
# strang
# stang
# tang
# tag
# ta
# a
# staunchest
# stanchest
# stanches
# stances
# stanes
# sanes
# anes
# ane
# ae
# a
# restarting
# restating
# estating
# stating
# sating
# sting
# ting
# tin
# in
# i
# 0.6459996700286865
# 无memos 1.5830001831054688, 有memos 0.6459996700286865
2020-04
22

字典转元组

By xrspook @ 13:15:37 归类于: 扮IT

不搞复杂的,不用超纲的方法做感觉上很简单的事其实不简单。搞明白这个练习后,列表、字典、元组的相爱相杀我算是有点明白了。感谢那个我觉得过于复杂的参考答案,逼我折腾出了我自己的版本。

开心!居然习题1就用上了zip这个这章书最后才提到的大招。字典的键值对互换变得如此简单,我的脑洞又开大了。

Exercise 1: Write a function called most_frequent that takes a string and prints the letters in decreasing order of frequency. Find text samples from several different languages and see how letter frequency varies between languages. Compare your results with the tables at http://en.wikipedia.org/wiki/Letter_frequencies. Solution: http://thinkpython2.com/code/most_frequent.py.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def most_frequent(sth):
    d = {}
    for letter in sth: # 字符串转为字符映射到频率的字典
        d[letter.lower()] = d.get(letter.lower(), 0) + 1 # 大写的你给我降为小写
    t = tuple(zip(d.values(), d.keys())) # 用zip互换字典的键和键值,生成元组(也可以生成列表,但生成新字典你会哭死)
    return sorted(t, reverse = True) # 以第一元素降序输出
sth = 'This chapter presents one more built-in type, the tuple, and then shows how lists, dictionaries, and tuples work together. I also present a useful feature for variable-length argument lists, the gather and scatter operators.'
t = most_frequent(sth)
for item in t:
    print(item) # 原汁原味输出元组好,因为空格不用''圈着都不知道那里有东西
# (33, ' ')
# (25, 'e')
# (23, 't')
# (16, 's')
# (15, 'r')
# (14, 'a')
# (11, 'o')
# (11, 'n')
# (10, 'i')
# (10, 'h')
# (9, 'l')
# (7, 'u')
# (7, 'p')
# (5, ',')
# (4, 'g')
# (4, 'd')
# (3, 'w')
# (3, 'f')
# (3, 'c')
# (2, 'm')
# (2, 'b')
# (2, '.')
# (2, '-')
# (1, 'y')
# (1, 'v')
# (1, 'k')
2020-04
22

写得过于简单了吧

By xrspook @ 9:29:12 归类于: 烂日记

我看的那个中文版Think Python 2的第11章,习题几乎可以这么说,没一道题的相关是做对的。有些东西,习题跟现在的Think Python 2英文版不一样,我觉得可能是英文版修正了某些东西。修正完以后,英文版并没有写什么更新说明之类的东西,但即便有,中文版翻译的人翻译完以后大概也不会时刻关注英文版有没有变动。同一道题的关联引用,这种东西怎么可以凭着感觉来呢?第一次遇到的时候,我云里雾里,因为按照中文版的那个说法,在某一章书的习题里根本找不到那个东西,但在我印象中,我的确是做过那些习题,到底是在哪一章书里的呢?另外一个版本的中文版那里的索引做对了。当我回看英文原版的时候,发现他们引用某个习题真的写得超简单,但是那些习题全部都是有超连接的,而且锚点到了某一章书的某个确切位置。所以我看的那个中文版之所以是现在这个样子,是不是因为他只是进行了文字上的翻译,还没有仔细研究那些超链接之类的东西呢?又或者在他翻译的时候,他并不是看着英文电子版,我是看英文纸质或者PDF等没有超链接之类的东西。电子版这种东西最让人觉得舒服的应该是哪怕你不标明是那是哪一章书的习题,你只是说那是习题几,只要你把超链接做上去,一切好办。

我看的那个中文版让我无语,但是英文原版也好不到哪里去。因为英文原版的作者脑洞实在太大,让人不知道如何言表。我觉得变成编程这东西,你得给出一个范围,然后再给出一个参考结果,才能让人有奔头。但是他只叫你这般测试,然后再给出一个参考的脚本。在不看脚本之前,如果我只是运行的脚本,而一开始的时候,我们设定的自定义参数不同,该如何判断读者自己的脚本写得对不对呢?有些参数他们完全可以先给出来,或者你会说,那岂不是成了一个提示。这又不是测验考试,重要的不是结果,而是如何整出那个结果,所以即便给你看到最终结果,你不懂过程,还是折腾不出来。英文原版的作者把那些条件参数写在他的脚本里面,而当你写自己的脚本的时候,你又清楚明白到你必须得设定那些参数,但是应该确定为什么呢?拿不准。这就会进入一个怪圈。首先是读者看着题目,自己解答,解答完以后测试,觉得没问题,打开参考答案运行,因为参数答案的结果跟读者得到的不一样,接着我们还得去仔细研究参考答案到底用了什么参数。然后把那个参数放在我们自己的脚本里再运行。对比两个运行的效果。这不是非常折腾吗?如果没有那个隐藏的参数,读者完全可以自己运行一遍,拿参考答案再运行一遍,对比结果就行了。我们最终要做到的不就是那个结果吗?当然,如果得出结果完全一致以后,还是建议大家再研究一下参考答案跟自己的脚本有什么区别,想一想这些区别会不会导致什么问题。

Think Python 2这本书,不只是习题参数设置很让人迷茫。在某些章节里,写得也不细致,比如说元组那一章。元组这个东西据说在其他编程语言里面是没有的。在介绍元组之前,已经说过字符串、列表以及字典。前面三个东西一个套一个,挺容易理解,但是加了元祖进去以后,世界就变得混沌了。因为列表字典元组这三个东西可以互相组合互相转化。有些可变,有些不可变。有些人可排序,有些无效。要真的说清楚元组,第12章书里面已经说到的东西以及已经举过的例子还不足以解决问题。如果他们不把所有东西都说一遍,起码他们得给我们一个列表。告诉我们这些东西可以控制元组,至于具体的使用方法,大家可以查手册。

或许当我把一整本书都看完以后,我的想法又不是我现在这个了。

© 2004 - 2024 我的天 | Theme by xrspook | Power by WordPress