DANIEL’S SEARCH HACK
by Richard White
On one of the last days of the AP Computer Science class, I met with a few students who had been participating in the High School Capture the Flag hacking contest. A high school in New Jersey had created a competition that would allow teams of high school students using digital tools to solve various types of puzzles.
One of the problems involved a text file with just two lines in it. The first line explaining that students would need to search the file for a series of English words in sequence, although those words wouldn’t be separated by spaces. The second line consisted of over 10 million letters, mostly scrambled, but with words occasionally found within them.
Here are 200 letters from that file:
You can see the word “hiss” in there, as well as “rips”, “fee”, “call”, “has”, as well as a number of 1- and 2-letter words, but clearly nothing identifiable as a sequential series of words.
So how do you go about looking those needles in that haystack?
I had some ideas, and I’d been working on the problem for a day or two. I wrote a Python program to read all 10 million characters into a string so that I could search through it. I Googled and found a couple of lists of English words, arranged in order of popularity, and inserted those into my program as a list of words.
But now what? How do you start trying to find a sequence of words in a line of ten million letters?
My first algorithm looked liked this:
- Get a word (call if word1) from the word list, and another word (word2) from the word list.
- Look in the text for an occurrence of the first word.
- If you find that word, look to see if the second word is near it. It it is print it out. If not…
- Keep looking for additional occurrences of the first word and the second word, until you can’t find anymore.
- Get a different word2 from the list and check that against word1.
- Repeat this until you’ve tried ever word in the list as word2.
- Set word1 to the next word in the list and repeat the whole process.
It didn’t work. It was a first attempt at trying to work through the file, and this particular strategy was much too slow—my search was going nowhere fast.
Time to alter the strategy.
My second try concentrated on reducing the number of text interactions that had to take place. This time, I would:
- Take the first word in the word list identify all the locations (indexes) where that word existed in the string.
- Do this again with every word in the word list until I had a “dictionary” of word locations.
- Now start at the first word, and look at its first location. Go through the other words and their locations, and if a second word was found withing 10 characters of this one…
- Look for a third word with a location with 10 characters of this one…
- … and then a fourth word. If I could find four words all within 30 characters or so, perhaps this would identify the flag phrase.
Here’s what the results look like for that strategy:
It didn’t work. I mean, the program worked fine, producing a seemingly endless string of lines that matched the specifications I had, but none of the lines produced were what I was looking for. Here’s the partial output from that second algorithm.
I still was getting “words” in each of those lines that didn’t correspond to the flag I was searching for. This was starting to get frustrating.
I went into class the next day and explained to the students my frustrations, and they were happy to brainstorm different strategies. At one point, Daniel said, “Why don’t you just look through each line in the file and count the number of words in it? Maybe the line with the most words will be the one we’re looking for.”
“That’s a good idea, Daniel, but the file isn’t split up into lines. It’s just one long line of characters.”
He thought about it for a minute, and his partner Ezra said, “Well, let’s just split it up into lines. 140 character lines. That’s good enough for Twitter.”
So that’s what we did. The students have been studying Java for this past year and I was coding in Python, so they watched while I coded their strategy:
for line in lines: num_of_words = 0 for word in searchTerms: if line.find(word) != -1: num_of_words += 1 if num_of_words > 25: print num_of_words, line
Because it was a relatively simple strategy, the two decided that they could afford to search for any words contained in 10,000-word dictionary. We watched as the results appeared on the screen:
Last login: Sat May 24 18:03:42 on ttys002
MotteRouge:~ rwhite$ /var/folders/6x/vklj_pls5215szrp_k2qcjcc0000gn/T/Cleanup\ At\ Startup/find3-422672661.132.py.command ; exit;
File has been split into lines! Proceeding…
And there it is, on the line with 33 matches, just a few moments after starting the program running:
these letters represent the cloth marker that unlocks rewards
We submitted the flag to the contest and saw out score total immediately jump up 400 points—it was our single biggest success of the competition to that point.
I like this story because it reminds me of a few things that we sometimes forget. Working on digital problems like this is not always about coding— sometimes it’s about strategies. Also, there are trade-offs in strategies that require considering constraints: how much memory, power, time, and data do you have? How do your solution results vary based on the trade-offs you make?
Computer Science teachers talk about these things, but I’m always pleased to see a situation where the students get to actually experience that exploratory process themselves.
Thanks to Daniel, Ezra, Adam, and Stephanie for being willing to play with this problem on the last day of school!