A tedious part of reading a book in a new language is having to jot down all the new words you’re encountering. My normal method is to read a couple of pages writing down words I don’t know and trying to get the gist of what’s happening. Then I’ll look up the words I’ve written down, learn them, and reread those pages. This isn’t a very efficient way to do things and undermines the experience of reading a book for the first time. Ideally, you’d know in advance which new words you’ll have to learn before you start reading. The problem is that unless you fork out on special learners’ editions of books, you can’t really do this. This is silly since most of the books I want to read are classics and on http://www.gutenberg.org/ anyway.
To help with this, I’ve written a short and simple python program which allows you to scan the words in a .txt file, separate them into words you know and words you don’t, and then convert the list of words you don’t know into a .csv file which can be imported to an Anki deck. Here’s a picture of the code (You’ll need to import collections, os, itertools, csv and io) :
Here’s an explanation.
The program gives you the option of sorting the words by either their frequency (most common words come first) or their order of appearance in the text. Both are reasonable approaches and your choice will probably depend on how much time you have. The first two functions will strip punctuation and capitalisation from the text and output strings sorted according to your desires.
The program will then present each word to you individually and you can respond by pressing ‘y’ or ‘n’ depending on whether or not you know the word already. If you get bored or need to stop early, you can type ‘quit’. You’ll then have the option of putting the words you don’t know into a python dictionary and/or a csv file. The file will be output to your desktop.
Once you have the csv file, you can import it to an Anki deck and fill in the meanings of the words you don’t know.
Here’s an example of how it can be used.
Go to project gutenberg and copy and paste ‘A Côté de Chez Swann’ to a .txt file on your desktop. Run the program and sort the obvious words (‘longtemp’, ‘de’, ‘yeux’) from the words you might not know (‘sifflement’). Make an Anki deck of the words you don’t know and look them up in the dictionary (‘sifflement’ means whistling). You can then learn the words you’ll need to know to read a chapter without having to read the whole chapter (I might write another post on Zipfian distributions which could help us determine exactly how much more efficient this method is than the reading and note-taking method.)
I’ve copy and pasted the code below. I can’t claim that it’s particularly elegant Feel free to use it or improve it. Most of the trouble I had when writing it concerned maintaining accents in UTF-8. The code represents a personal milestone in that it is the first labour-saving program I have written which was not more labour intensive to produce than the sum of labour saved.
Update:Friday 3rd March
Here’s the obvious extension of the program. This will automatically translate the words you don’t know into English and store them in your dictionary. You can then use these translations to build your Anki deck. It uses goslate – a free google translate API. Google have recently updated google translate to prevent this from working, however, if you switch around your VPN when you start getting HTTP ERROR 503 it should still work. You can download the source code for goslate here: https://pythonhosted.org/goslate/ . If you are using the program to learn a language, I’d recommend using a proper dictionary and not google translate.
import collections
import os
import itertools
import csv
import io
x = open (os.path.expanduser(“~/Desktop/Text.txt”), ‘rt’, encoding=’utf8′)
def prepfreq(l):
with l as x:
puncstrip = x.read().replace(“,<>?&^%$#@!.:;”’?/()”, “”)
low = puncstrip.lower()
split = low.split()
count = collections.Counter(split)
words = sorted(count, key = count.get, reverse = True)
return words
def prepapp(l):
with l as x:
puncstrip = x.read().replace(“,<>?&^%$#@!.:;”’?/()”, “”)
low = puncstrip.lower()
slow = low.split()
words = []
[words.append(x) for x in slow if x not in words]
return words
def call(x):
answer = input(“Would you like to sort the words by frequency or appearance?: “)
if answer == ‘frequency’:
list = prepfreq(x)
elif answer == ‘appearance’:
list = prepapp(x)
klist = []
nlist = []
start = 0
i = 0
while i < len(list):
inp = input(list[start] + ‘:’)
if inp == ‘y’:
klist.append(list[0])
list.remove(list[0])
elif inp == ‘n’:
nlist.append(list[0])
list.remove(list[0])
elif inp == ‘quit’:
i = len(list)
start = start + 1
print(‘All sorted. You know:’)
print(klist)
print(‘You dont know:’)
print(nlist)
dicto = input(“Would you like this as a dictionary? “)
if dicto == ‘yes’:
dictionary = dict(zip(nlist, [‘Translation’]*len(nlist)))
print(dictionary)
csv = input(“Would you like this as a csv file? “)
if csv == ‘yes’:
dictionary = dict(zip(nlist, [‘Translation’]*len(nlist)))
return convert(nlist, os.path.expanduser(“~/Desktop/words.csv”))
def convert(data, path):
with open (path, ‘w’, newline=”, encoding=’utf-8′) as output:
writer = csv.writer(output, data)
for word in data:
writer.writerow([word])
call(x)
For translations switch the bottom 13 lines to: