2022-05
22

又被中奖了

By xrspook @ 11:32:39 归类于: 烂日记

周六早上看到东莞疾控发布的消息,5月20日东莞又新增了一例本土新冠感染者,那是一个外省返莞人员,路径是省外搭乘高铁到达广州南,然后被集中转运至东莞的隔离管控场所,前两次核酸均为阴性,第三次核酸阳性。因为全程闭环管理,没有社会面的接触,所以东莞虎门也就没有了封控区之类的东西。虽然这只是一例无症状感染者,但是非常有可能会变成确诊人员。反正新冠核酸阳性以后,东莞又算是有了本土疫情。根据广州这边的管控措施,理论上又要进行3+11的管理。之前我已经谈过,在广东,广州深圳东莞这样人员流动非常频繁,上千万人级别的城市来说。这种外来人员,无论是全程闭环管理的旅客又或者是货车司机,出现核酸阳性最后被判定为确诊或者无症状的事情一定会经常发生。所以在东莞疾控发布那条消息的时候,最后那句话是因为全程闭环管理,所以社区传播风险低。这也就意味着这例阳性不是没有源头的本土疫情,也几乎没有社会面的传播。如果其它地方要对东莞套用各种政策的话,可能可以斟酌一下。但我觉得广州对东莞肯定只有非黑则白的模式,只要是有本土阳性,才不管你什么原因,又或者传播风险如何。

所以对我来说,这个周末回到广州,结果又是一天到晚都待在家里,哪里都不应该去,除了去做核酸。虽然那个3+11的三只是居家健康监测。

关于来返穗人员报备的问题,这是一个奇怪的设定。某晚上我忍无可忍,于是就在广州卫健委的微信公众号后台留言,请求他们能不能在每天疫情通报时候给大家说明当天有14天以内本土疫情的地级市,也就是需要进行3天居家健康监测+11天自我健康监测的到底是哪些。这个名单除了疾控的内部系统,除了居委跟疾控连通的系统,普通市民没办法查询到,在查询不到的前提下,我怎么知道自己要不要去报备呢?如果某个地方每天都有新冠核酸阳性这还比较好计算,但如果只是零星散发,只是偶尔一两个无症状,而且那些又是闭环管理人员,谁会记得来返穗前14天那个地方到底有还是没有?所以我觉得广州卫健委的公众号又或者是12320的公众号每天都只是把29、30、31号文搬出来是毫无意义的。我知道上面写的是什么,但是上面写的那些东西,在具体执行的时候到底意味着什么,我实在不清楚。情况就好像科技论文某些地方他用了脚注,但是当你要看最后的引用的时候发现那个文献你是无论如何查询不到的。这种每天都搬一模一样的东西出来,只改个日期,实际上无法执行的事情纯粹只是应付领导。相比之下,东莞疾控微信公众号那边每天把从什么时候开始,什么地区来的人需要进行什么样的健康管理全部都写得清清楚楚。列表是每天更新的,于是可能你每天打开那个页面的时候都有种战战兢兢的感觉,但起码你能在上面清清楚楚的看到自己有没有中奖。

如果我只是在一个城市生活,无论那个城市有什么条款,我全部都只能全盘接收。但如果生活的不只是一个城市,自然就会有了对比。三人行必有我师,大概就是这样了。海珠发布的公众号上,有一个中高风险地区的管理方办法,里面就有详细的时间跟地点,但问题是那些地方其实只是中高风险地区,而29号文里面写的是14天以内有本土疫情的地级市,所以海珠区的确是更进一步了,但是那仍然没有达到我想要的效果,如果列表是太长的话,广州卫健委完全可以做一个查询的小程序。这些明明可以公开透明的东西,为什么就不可以让普通老百姓明明白白呢。

战战兢兢的日子还在继续。

2020-06
5

随机单词扎堆成文

By xrspook @ 14:47:54 归类于: 扮IT

从某本书里随机找单词拼出句子段落。重点是把握好前缀和后缀,前缀要捆绑查找,后缀要关联对应。

Exercise 8: Markov analysis: Write a program to read a text from a file and perform Markov analysis. The result should be a dictionary that maps from prefixes to a collection of possible suffixes. The collection might be a list, tuple, or dictionary; it is up to you to make an appropriate choice. You can test your program with prefix length two, but you should write the program in a way that makes it easy to try other lengths. Add a function to the previous program to generate random text based on the Markov analysis. Here is an example from Emma with prefix length 2: He was very clever, be it sweetness or be angry, ashamed or only amused, at such a stroke. She had never thought of Hannah till you were never meant for me?” “I cannot make speeches, Emma:” he soon cut it all himself. For this example, I left the punctuation attached to the words. The result is almost syntactically correct, but not quite. Semantically, it almost makes sense, but not quite. What happens if you increase the prefix length? Does the random text make more sense? Once your program is working, you might want to try a mash-up: if you combine text from two or more books, the random text you generate will blend the vocabulary and phrases from the sources in interesting ways. Credit: This case study is based on an example from Kernighan and Pike, The Practice of Programming, Addison-Wesley, 1999. You should attempt this exercise before you go on; then you can download my solution from http://thinkpython2.com/code/markov.py. You will also need http://thinkpython2.com/code/emma.txt.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import string
import random
from collections import defaultdict
def set_book(fin1,num):
    d = defaultdict(list) # 默认键值为列表
    l = []
    header = ()
    for line in fin1:
        line = line.replace('-', ' ')
        for word in line.rstrip().split(): # 空格换行为分割,单词存入列表
            l.append(word)
    for i in range(len(l)-num): # 以列表序号逐一推进方式建立字典
        header = (l[i-1],) # 元组header为前缀,做键
        for j in range(i,i+num-1):
            header += (l[j],)
            j += 1
        if l[i+num-1] not in d[header]:
            d[header].append(l[i+num-1]) # 列表后缀做键值
    return d
def next(start, book):
    return random.choice(book[start])
fin1 = open('emma.txt', encoding='utf-8')
prefix_num = 3 # 前缀个数
suffix_num = 100 # 后缀个数
book = set_book(fin1,prefix_num)
start = random.choice(list(book.keys())) # 随机前缀开头
final =  start
for i in range(suffix_num): # 截取最后几个单词为前缀找后缀
    final += (next(final[len(final)-prefix_num:], book),) 
for word in final:
    print(word, end=' ')
# reigns alone. A very proper compliment! and then follows the application, 
# which I think, my dear, you said you had a great deal happier if she had no 
# intellectual superiority to make atonement to herself, or frighten those 
# who might hate her into outward respect. She had never seen her look so well, 
# so lovely, so engaging. There was consciousness, animation, and warmth; 
# there was every appearance of its being all in proof of how much he was 
# in love with, how to be able to return! I shall try what I can do. 
# Harriet's features are very delicate, which makes a likeness
2020-06
5

上路

By xrspook @ 8:29:07 归类于: 烂日记

我已经不记得对上一次,写python是什么时候的事了,感觉好遥远,起码一个多月以前。具体时间,我实在记不清了,但是我依然记得,上一次我卡在了哪里,我应该在哪里重新开始。当时我看到的是第14章,但实际上第13章的内容我还没有全部消化掉,前面的那些我花的时间还多一点,后面的那些简直就是囫囵吞枣。第13章最后一道练习题,我觉得自己是无论如何不会去想的了,因为我根本不知道题目到底要我做些什么,之所以这样,大概是因为我的数学学得不好,所以我无法理解题目的意思。但是倒数第二道题目,我觉得自己还是可以做到的。

那是一道从一本书里随机的选择某些单词组成一些可能有意思的句子。随机拼凑句子语意当然乱来,但是如果能保证单词前面和后面相对稳定,那么起码单词组合起来会有某些意思,虽然可能句子的意思还是很无厘头。随着前面后面单词的整体性加强,整个句子的意思也会越发明了。这其实就是一个靠着前缀找后缀的运行模式。开始的时候默认的前缀是两个单词。由前面的两个单词找出后面一个单词,然后再利用后面的两个单词找下一个单词,如此类推。这种方法理论上可以扩展为结合前面N个单词找后面一个单词,然后再撇掉第1个单词,继续找下一个。思路不复杂,但是该用什么实现这个呢?的确是需要点心思的是Think Python那本书没有把所有方法都告诉你,在最终写出这道题目的解答之前,我看过他们的答案,但我觉得自己没看懂,因为里面加入了很多书里之前根本没说过的东西。里面默认带入了很多他们认为你必须知道,所以无需解释的东西。如果这是一本传统的教程,这简直让人日子没法过了!做这本书的习题的时候,我也吐槽过无数次,他们会无底线地超纲。但也正是因为这些说来就来的超纲,让你除了要看这本书以外,你还必须动脑筋,还必须自己手动去搜索解决方法,找那些他们觉得你一定得懂,但实际上他们又没说的东西。最终我写出了我想要的东西,至于结果跟他们的差多远,我没有比较。很多人说python是一种类似于乐高积木的编程,是一个模块叠加一个模块的。但是里面的递归却让我很头晕,所以当参考答案用上全局函数,用上递归的时候,我选择的依然是循环,依然是在主函数里输出那些东西,同时也在一句话里面嵌套了好几个我想做的事。我当然可以把我嵌套的东西单独出来定制一个函数,但是一句话能说清的事情我不想再写几行,虽然在用的时候,多写几行可能会调取得方便一些。现在我之所以不这么干,是因为我要实现的功能暂时来说还很简单。我用一句话就实现了,只不过嵌套了好几个参数而已,Excel的函数也是这么玩的。虽然有些时候,我也会狠狠地吐槽那些几万公里那么长的Excel函数公式。

我从来没想过,自己能在半天之内解决一个之前我曾经想过但是却没想出解决办法的问题。

2020-04
26

算算书里有多少单词

By xrspook @ 18:12:57 归类于: 扮IT

算算书里有多少单词应该是很大路简单的事,但实际上各种状况层出不穷。有些是你料到的,比如排版的用了全角的标点符号,程序默认会删掉标点符号,万一排版那个没有规范地使用空格呢?有些是你不会料到的,比如手误创造出奇葩字符串。很早以前我就发现Notepad++和Word里算的字数是不一致的,Notepad++通常算出来的数都会大一些。谁对谁错,随缘吧,知道大概差不多也就行了,毕竟高考的时候你写少几个字不到800也不会真扣你的分。

字典和列表的相爱相杀我体会得越来越深刻了。

words.txt在这里,emma.txt在这里。

Exercise 1: Write a program that reads a file, breaks each line into words, strips whitespace and punctuation from the words, and converts them to lowercase. Hint: The string module provides a string named whitespace, which contains space, tab, newline, etc., and punctuation which contains the punctuation characters. Let’s see if we can make Python swear:
>>> import string
>>> string.punctuation
‘!”#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~’
Also, you might consider using the string methods strip, replace and translate.

Exercise 2: Go to Project Gutenberg (http://gutenberg.org) and download your favorite out-of-copyright book in plain text format. Modify your program from the previous exercise to read the book you downloaded, skip over the header information at the beginning of the file, and process the rest of the words as before. Then modify the program to count the total number of words in the book, and the number of times each word is used. Print the number of different words used in the book. Compare different books by different authors, written in different eras. Which author uses the most extensive vocabulary?

Exercise 3: Modify the program from the previous exercise to print the 20 most frequently used words in the book.

Exercise 4: Modify the previous program to read a word list (see Section 9.1) and then print all the words in the book that are not in the word list. How many of them are typos? How many of them are common words that should be in the word list, and how many of them are really obscure?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
import string
fin = open('words.txt')
mydict = {}
for line in fin:
    word = line.strip()
    mydict[word] = ''
file = open('emma.txt', encoding = 'utf-8')
essay = file.read().lower()
essay = essay.replace('-', ' ')
pun = {}
str_all = '“' + '”' + string.punctuation
for x in str_all: # 建立各种标点符号字符的字典
    pun[x] = ''
useless = essay.maketrans(pun) # maketrans必须被替换和替换等长,字典完美解决这个问题
l = essay.translate(useless).split() # 那些含-的单词会死得很惨,但仍然算是个单词
print('this book has', len(l), 'words')
book = {}
for item in l: # 读取文件为字符串,字符串转为单词列表,列表转为计数的字典,单词为键,次数为键值
    book[item] = book.get(item, 0) + 1
list_words1 = sorted(list(zip(book.values(), book.keys())), reverse = True) # 字典转为列表,键与键值换位
print('this book has', len(list_words1), 'different words')
print('times', 'word', sep='\t')
count = 1
word_len = 0 # 限制最小词长
for times, word in list_words1: # 打印大于某长度用得最多的20个词(不限制,3个字母及以下最最简单的会刷屏)
    if len(word) > word_len:
        print(times, word, sep='\t')
        count += 1
    if count > 20:
        break
count = 0
for word in book:
    if word not in mydict:
        # print(word, end=' ')
        count += 1
print(count, 'words in book not in dict') # 结果惨不忍睹,合计590个
# this book has 164065 words
# this book has 7479 different words
# times   word
# 5379    the
# 5322    to
# 4965    and
# 4412    of
# 3191    i
# 3187    a
# 2544    it
# 2483    her
# 2401    was
# 2365    she
# 2246    in
# 2172    not
# 2069    you
# 1995    be
# 1815    that
# 1813    he
# 1626    had
# 1448    as
# 1446    but
# 1373    for
# 590 words in book not in dict
# -----------------------------解法二----------------------------- 其实就是切单词方法有差异
import string
def set_book(fin1):
    useless = string.punctuation + string.whitespace + '“' + '”'
    d = {}
    for line in fin1:
        line = line.replace('-', ' ')
        for word in line.split():
            word = word.strip(useless)
            word = word.lower()
            d[word] = d.get(word, 0) + 1
    return d
def set_dict(fin2):
    d = {}
    for line in fin2:
        word = line.strip()
        d[word] = d.get(word, 0) + 1
    return d
fin1 = open('emma.txt', encoding='utf-8')
fin2 = open('words.txt')
book = set_book(fin1)
mydict = set_dict(fin2)
l = sorted(list(zip(book.values(), book.keys())), reverse=True)
count = 0
for key in book:
    count = count + book[key]
print('this book has', count, 'words')
print('this book has', len(book), 'different words')
num = 20
print(num, 'most common words in this book')
print('times', 'word', sep='\t')
for times, word in l:
    print(times, word, sep='\t')
    num -= 1
    if num < 1:
        break
count = 0
for word in book:
    if word not in mydict:
        # print(word, end=' ')
        count += 1
# print()
print(count, 'words in book not in dict')
# this book has 164120 words
# this book has 7531 different words
# 20 most common words in this book
# times   word
# 5379    the
# 5322    to
# 4965    and
# 4412    of
# 3191    i
# 3187    a
# 2544    it
# 2483    her
# 2401    was
# 2364    she
# 2246    in
# 2172    not
# 2069    you
# 1995    be
# 1815    that
# 1813    he
# 1626    had
# 1448    as
# 1446    but
# 1373    for
# 683 words in book not in dict
2020-04
24

用两天琢磨一道题

By xrspook @ 20:32:30 归类于: 扮IT

前面还在沾沾自喜我写出来的脚本运行效率战胜了参考答案,但这道题目我是看着参考答案都不知道他们在说什么。如果只是一个词,我的确可以列举出它一次减少一个字母可以出现的所有可能,但怎么知道上一层可能和这一层的哪个配套???我花了2天时间去研究、消化答案。一边搞清楚答案为什么这样,另一边考虑有没有其它容易吃透的表达方式。这道题之所以让我非常纠结,根本的原因是我想不透到底我可以用什么手段实现。没有可以实现的逻辑,就不会有可行的编程。

Exercise 4: Here’s another Car Talk Puzzler (http://www.cartalk.com/content/puzzlers): What is the longest English word, that remains a valid English word, as you remove its letters one at a time? Now, letters can be removed from either end, or the middle, but you can’t rearrange any of the letters. Every time you drop a letter, you wind up with another English word. If you do that, you’re eventually going to wind up with one letter and that too is going to be an English word—one that’s found in the dictionary. I want to know what’s the longest word and how many letters does it have? I’m going to give you a little modest example: Sprite. Ok? You start off with sprite, you take a letter off, one from the interior of the word, take the r away, and we’re left with the word spite, then we take the e off the end, we’re left with spit, we take the s off, we’re left with pit, it, and I. Write a program to find all words that can be reduced in this way, and then find the longest one. This exercise is a little more challenging than most, so here are some suggestions: You might want to write a function that takes a word and computes a list of all the words that can be formed by removing one letter. These are the “children” of the word. Recursively, a word is reducible if any of its children are reducible. As a base case, you can consider the empty string reducible. The wordlist I provided, words.txt, doesn’t contain single letter words. So you might want to add “I”, “a”, and the empty string. To improve the performance of your program, you might want to memoize the words that are known to be reducible. Solution: http://thinkpython2.com/code/reducible.py.

最终,我觉得自己总算消化了,顺便画了个思维导图帮助大家理解到底分解到什么程度叫做完成,什么状态叫做分解失败。[”]和[]是两种不同的东西!!!!!!

is_reducible()是最关键的函数,memos用在这里,memos初始设置了known[”] = [”]也很关键,这是个守卫模式,没有守卫is_reducible()根本没法玩。这个脚本里的5个函数,除了一开始的创建字典函数,其余函数都可以单独测试,把一个固定单词放进去脚手架测试,可以帮助理解。cut_letter(),is_reducible()和all_reducible()这三个函数最终返回的都是列表,它们的样式都是类似的。希望我理解过程中的注释能帮助到有需要的人。PS一句:参考答案的打印效果让人很晕,我修改版的打印效果很美丽:)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
from time import time
def set_dict(fin):
    d = {}
    for line in fin:
        word = line.strip()
        d[word] = 0
    for word in ['a', 'i', '']:
        d[word] = 0
    return d
def cut_letter(word, d): # 生成子单词,返回列表
    l = []
    for i in range(len(word)):
        new_word = word[:i] + word[i+1:]
        if new_word in d:
            l.append(new_word)
    return l # ['']长度为1,[]长度为0,无子词不能分解时返回[],'a'返回['']
def is_reducible(word, d): # 判断能否生成无限子单词,返回列表
    if word in known: # 守卫模式下,''空字符串被列入初始字典,不列入永远会被递归到[],无果
        return known[word]
    # if word == '': # 不用memos的时候,需要加入这句守卫
    #     return ['']
    l = []
    for new_word in cut_letter(word, d):
        if len(is_reducible(new_word, d)) > 0:
            l.append(new_word)
    known[word] = l
    return l
def all_reducible(d): # 收集所有无限子单词的单词,返回列表
    l = []
    for word in d:
        if len(is_reducible(word, d)) > 0: # 有列表,即有无限子单词
            l.append((len(word), word)) # 列表含有N个元组,元组里有2个元素,1为单词的字母数量,2为单词本尊
    new_l = sorted(l, reverse = True) # 每次减少一个字母,单词的字母越多当然就能降解出越多层了
    return new_l
def word_list(word): # 打印单词及子单词
    if len(word) == 0: # 最后一个进入is_reducible()的是[''],对应l[0]为无,打印结束
        return
    print(word)
    l = is_reducible(word, d) # 因为是被鉴定过词汇表里的词,所以必定有无限子单词
    word_list(l[0]) # 子单词有多个时只选第1个
known = {} # memos实际上只在is_reducible()起作用,除了提高效率,还能用作守卫
known[''] = [''] # 因为is_reducible()返回的是列表,所以即便是空字符串,键值也必须是列表!
fin = open('words.txt')
start = time()
d = set_dict(fin) # 普通的字典,键为单词,键值为0
words = all_reducible(d) # 列表,元组,2元素
for i in range(5):
    word_list(words[i][1]) # 列表里第某个元组的第2个元素
end = time()
print(end - start)
# complecting
# completing
# competing
# compting
# comping
# coping
# oping
# ping
# pig
# pi
# i
# twitchiest
# witchiest
# withiest
# withies
# withes
# wites
# wits
# its
# is
# i
# stranglers
# strangers
# stranger
# strange
# strang
# stang
# tang
# tag
# ta
# a
# staunchest
# stanchest
# stanches
# stances
# stanes
# sanes
# anes
# ane
# ae
# a
# restarting
# restating
# estating
# stating
# sating
# sting
# ting
# tin
# in
# i
# 0.6459996700286865
# 无memos 1.5830001831054688, 有memos 0.6459996700286865
© 2004 - 2024 我的天 | Theme by xrspook | Power by WordPress