2020-04
26

算算书里有多少单词

By xrspook @ 18:12:57 归类于: 扮IT

算算书里有多少单词应该是很大路简单的事,但实际上各种状况层出不穷。有些是你料到的,比如排版的用了全角的标点符号,程序默认会删掉标点符号,万一排版那个没有规范地使用空格呢?有些是你不会料到的,比如手误创造出奇葩字符串。很早以前我就发现Notepad++和Word里算的字数是不一致的,Notepad++通常算出来的数都会大一些。谁对谁错,随缘吧,知道大概差不多也就行了,毕竟高考的时候你写少几个字不到800也不会真扣你的分。

字典和列表的相爱相杀我体会得越来越深刻了。

words.txt在这里,emma.txt在这里。

Exercise 1: Write a program that reads a file, breaks each line into words, strips whitespace and punctuation from the words, and converts them to lowercase. Hint: The string module provides a string named whitespace, which contains space, tab, newline, etc., and punctuation which contains the punctuation characters. Let’s see if we can make Python swear:
>>> import string
>>> string.punctuation
‘!”#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~’
Also, you might consider using the string methods strip, replace and translate.

Exercise 2: Go to Project Gutenberg (http://gutenberg.org) and download your favorite out-of-copyright book in plain text format. Modify your program from the previous exercise to read the book you downloaded, skip over the header information at the beginning of the file, and process the rest of the words as before. Then modify the program to count the total number of words in the book, and the number of times each word is used. Print the number of different words used in the book. Compare different books by different authors, written in different eras. Which author uses the most extensive vocabulary?

Exercise 3: Modify the program from the previous exercise to print the 20 most frequently used words in the book.

Exercise 4: Modify the previous program to read a word list (see Section 9.1) and then print all the words in the book that are not in the word list. How many of them are typos? How many of them are common words that should be in the word list, and how many of them are really obscure?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
import string
fin = open('words.txt')
mydict = {}
for line in fin:
    word = line.strip()
    mydict[word] = ''
file = open('emma.txt', encoding = 'utf-8')
essay = file.read().lower()
essay = essay.replace('-', ' ')
pun = {}
str_all = '“' + '”' + string.punctuation
for x in str_all: # 建立各种标点符号字符的字典
    pun[x] = ''
useless = essay.maketrans(pun) # maketrans必须被替换和替换等长,字典完美解决这个问题
l = essay.translate(useless).split() # 那些含-的单词会死得很惨,但仍然算是个单词
print('this book has', len(l), 'words')
book = {}
for item in l: # 读取文件为字符串,字符串转为单词列表,列表转为计数的字典,单词为键,次数为键值
    book[item] = book.get(item, 0) + 1
list_words1 = sorted(list(zip(book.values(), book.keys())), reverse = True) # 字典转为列表,键与键值换位
print('this book has', len(list_words1), 'different words')
print('times', 'word', sep='\t')
count = 1
word_len = 0 # 限制最小词长
for times, word in list_words1: # 打印大于某长度用得最多的20个词(不限制,3个字母及以下最最简单的会刷屏)
    if len(word) > word_len:
        print(times, word, sep='\t')
        count += 1
    if count > 20:
        break
count = 0
for word in book:
    if word not in mydict:
        # print(word, end=' ')
        count += 1
print(count, 'words in book not in dict') # 结果惨不忍睹,合计590个
# this book has 164065 words
# this book has 7479 different words
# times   word
# 5379    the
# 5322    to
# 4965    and
# 4412    of
# 3191    i
# 3187    a
# 2544    it
# 2483    her
# 2401    was
# 2365    she
# 2246    in
# 2172    not
# 2069    you
# 1995    be
# 1815    that
# 1813    he
# 1626    had
# 1448    as
# 1446    but
# 1373    for
# 590 words in book not in dict
# -----------------------------解法二----------------------------- 其实就是切单词方法有差异
import string
def set_book(fin1):
    useless = string.punctuation + string.whitespace + '“' + '”'
    d = {}
    for line in fin1:
        line = line.replace('-', ' ')
        for word in line.split():
            word = word.strip(useless)
            word = word.lower()
            d[word] = d.get(word, 0) + 1
    return d
def set_dict(fin2):
    d = {}
    for line in fin2:
        word = line.strip()
        d[word] = d.get(word, 0) + 1
    return d
fin1 = open('emma.txt', encoding='utf-8')
fin2 = open('words.txt')
book = set_book(fin1)
mydict = set_dict(fin2)
l = sorted(list(zip(book.values(), book.keys())), reverse=True)
count = 0
for key in book:
    count = count + book[key]
print('this book has', count, 'words')
print('this book has', len(book), 'different words')
num = 20
print(num, 'most common words in this book')
print('times', 'word', sep='\t')
for times, word in l:
    print(times, word, sep='\t')
    num -= 1
    if num < 1:
        break
count = 0
for word in book:
    if word not in mydict:
        # print(word, end=' ')
        count += 1
# print()
print(count, 'words in book not in dict')
# this book has 164120 words
# this book has 7531 different words
# 20 most common words in this book
# times   word
# 5379    the
# 5322    to
# 4965    and
# 4412    of
# 3191    i
# 3187    a
# 2544    it
# 2483    her
# 2401    was
# 2364    she
# 2246    in
# 2172    not
# 2069    you
# 1995    be
# 1815    that
# 1813    he
# 1626    had
# 1448    as
# 1446    but
# 1373    for
# 683 words in book not in dict
2020-04
26

令人兴奋的Excel新函数filter

By xrspook @ 9:17:59 归类于: 烂日记

大概2020-04-22就听说Office 365要变成Microsoft 365,但我Win 7上的Office 365貌似没有什么变化。之前更新了个版本,牛逼的filter函数有了!有了filter以后高级筛选那些复杂的东西根本无需存在。如果只是为了筛选明细,filter很完美了。我觉得这个应该是数据库的函数,但现在也直接在公式层面实现了,牛逼!filter能做高级筛选,但如果我要的是筛选后数据的汇总,我个人感觉还是数据透视表好一点。其实数据透视表也能做明细的筛选,但要实现那个,之前你得先给明细数据增加一个唯一的序列号。从接触Excel开始我就认识高级筛选,但显然除了是老师要求必须得那么干,我平时是不会用到的。我会对多个条件进行筛选,但那都只是单条件筛选的叠加,这个做法在实际工作中更常用。有了filter函数后,筛选又可以玩出新花样了,实在让人很兴奋。不过兴奋归兴奋,filter是Offcie 365的新函数,Office 2019有没有都是个问题,反正Offcie 2016是无论如何都没有的了。之前我还担心了半天Win 7下的Offcie 365会不会给我增加,因为自从微软停止支持Win 7以后,Win 7上的Offcie 365只会进行安全更新。如果是系统性能不支持,新功能更新不上去也就算了,但如果系统是支持的,只是因为微软要强迫你放弃Win 7,放弃老电脑,必须得买他们的新硬件、新软件,这样就太霸道了不是吗!filter这个函数能在Win 7的Offcie 365上使用,一定程度上我觉得是运气。我能使用,但如果我把文件发给别人,他们没有这个高级别的Offcie,还是会撞板。从前我觉得Offcie版本的不同对我来说绝大多数时候都只是版式上的改变,其实功能还是大都不变,所以我才会一直使用Offce 2003直到Offcie 365(当时对应的是Offcie 2016)。之前我之所以没有欲求,是因为新版本Offcie有什么我不知道,我不知道都增加了什么,怎么会对那些内容感兴趣呢?!

人知道得多了,就会有种冲动想知道得更多。这种与生俱来的冲动我都说不上到底算不算是递归。

过去的4月,已经有3拨审计的过来(前3拨,只完成了1拨),明天还会有第4拨。我不紧张,顶多是把纸质材料从已经归档的盒子里拿出来,排一下序,电子版也从已归档的地方拷贝出来整理一下。该干的我从前都已经做完了,没有可慌的。我的同事也不觉得他们有什么可慌,但检查过程中,他们的流程漏洞百出,为什么他们应该做的事都没做?为什么这般整居然没人知道?没人觉得不妥?没人要求他们必须改?懒惰得有个限度,当懒惰到达连最基本的逻辑都无法保证,这就太说不过去了!

我一直在严防死守,但原来别人不是……

© 2004 - 2024 我的天 | Theme by xrspook | Power by WordPress