The one feature that I was going to miss the most when moving to Jekyll was having a full-text search for my site. But with a little study, it turns out I was able to get a rudimentary search going with JQuery. The problem is not so much in making a full-text index in JQuery, the problem is making one that doesn't require downloading all the content in the site to calculate a query.
All I wanted was a giant hash table so that I could look up a word and know which posts had that word. It's not rocket science. Hash tables are pretty easy to create with JSON, so why not have a series of pages that were JSON hashes telling you which words pointed to which posts. The main drawback is that your hash table is bound to be enormous. That's not polite to the client to make them download a huge file and keep it in memory. Most users, when searching, generally only look for a few words. What if you could break the full-text into separate files and only load the part of the index that you need at the time? Could you make a responsive full-text index? Yes, you could.
This is how it works...
When the site is being compiled, Jekyll takes all the text for each post and creates a giant hash of words and their posts. Most words will have multiple posts.
Once all the posts have been indexed, Jekyll create a series of JSON files containing the words and the associated posts. It breaks the files based on the first letters of the word. So "smoke" ends up in "sm.json" and "wreck" ends up in "wr.json".
When a user executes a search on the site, JQuery takes the first two letters of whatever the user typed and fetches that JSON file. It gets the word that it was looking for and can identify which posts it needs to show to the user.
I then created a special layout for the search result objects. So JQuery then gets the HTML for the associated posts and displays them to the end user. It turns out, this happens relatively quickly. It's faster than a lot of client side searches I've used (I'm looking at you, Drupal).
Here's what you need to do to get this work. First of all, put this in your _config.yml
:
searchindex: yes
Next, we'll create the layout for the search result snippet. Create a file called search_post.html
in your _layouts
directory with this in it:
<div class="search-entry">
<div class="title"><a href="{{ page.link }}" title="{{ page.title }}">{{ page.title }}</a></div>
<div class="description">{{ page.description }}</div>
</div>
Note that my Jekyll site uses a description
attribute in the YAML front matter that I think is not standard with Jekyll. It comes in handy for times like this and also for creating meta tags and open graph tags.
Next, create a file in your _plugins
directory called generate_searchindex.rb
. Wrap the whole thing in a module Jekyll
statement. Add a new Page object. This will be the series of JSON files that we will be creating:
class SearchIndex < Page
def initialize(site, base, dir, letters)
@site = site
@base = base
@dir = dir
@name = "#{letters}.json"
self.process(@name)
end
end
Next, create a page object for the search result pages:
class SearchPost < Page
def initialize(site, base, dir, pid, post)
@site = site
@base = base
@dir = dir
@name = "#{pid}.html"
self.process(@name)
# Read the YAML data from the layout page.
self.read_yaml(File.join(base, '_layouts'), 'search_post.html')
# Set the title for this page.
self.data['title'] = post.data['title']
self.data['link'] = post.url
self.data['description'] = post.data['description']
end
end
Now, we need to do the actual processing. This plugin is essentially a generator, so let's create a new generator object:
class SearchGenerator < Generator
safe false
priority :low
def generate(site)
site.write_search_files if (site.config['searchindex'])
end
end
You'll notice that I marked this as not safe. This plugin will not run on Github because I included stemming. And for stemming to work, you have to run it on your own machine and upload the files to Github separately.
NOTE ON STEMMING: I use stemming so that "running" returns "runs" or "run" or "running". I decided that it was sort of a must-have for a basic search engine. I was originally using ruby-stemmer
from https://github.com/aurelian/ruby-stemmer, but I have since switched to fast_stemmer
, which is much faster (hence, the name).
Here's the important code:
class Site
attr_accessor :search_index, :search_posts
def write_search_files
createindex!
dir = self.config['search_dir'] || 'search'
self.search_index.keys.each do |letter|
write_search_index(self, File.join(dir, 'terms'), letter, self.search_index[letter])
end
self.search_posts.keys.each do |i|
write_search_post(self, File.join(dir, 'posts'), i, self.search_posts[i])
end
end
def write_search_index(site, dir, letter, data)
require 'json'
index = SearchIndex.new(site, site.source, dir, letter)
index.output = data.to_json
index.write(site.dest)
self.static_files << index
end
def write_search_post(site, dir, pid, post)
index = SearchPost.new(site, site.source, dir, pid, post)
index.render(site.layouts, site_payload)
index.write(site.dest)
# Record the fact that this page has been added, otherwise Site::cleanup will remove it.
self.static_files << index
end
def createindex!
searchwords = Hash.new
postlist = Hash.new
self.posts.each_index do |i|
rawtext = self.posts[i].to_s.downcase
if self.posts[i].data['title']
rawtext << ' ' + self.posts[i].data['title'].downcase
end
if self.posts[i].data['description']
rawtext << ' ' + self.posts[i].data['description'].downcase
end
rawtext.scan(/[a-zA-Z0-9]{1,}/).each do |word|
postlist[i] = self.posts[i]
letter = word.stem[0,2]
# does the two-letter version exist? if not, add it
if !searchwords.key?(letter)
searchwords[letter] = Hash.new
end
# does the full stem version exist? if not, add it
if !searchwords[letter].key?(word.stem)
searchwords[letter][word.stem] = Array.new
end
# add the post key to the hash
searchwords[letter][word.stem].push(i)
end
end
self.search_index = searchwords
self.search_posts = postlist
end
end
This code creates two big hash variables in the createindex!
method. One stores the association of words to post ids. The code creates a series of JSON files and puts them in search/terms
.
For instance, in the al.json
file, you would see the word "alone" with several ids after it. Then you would see the word "align" with different ids. Each id is a post that contains that word. And that id relates specifically to the other big hash variable that was created.
That other hash variable stores a relationship of post ids to their posts. For instance, 23.html
is the snippet of HTML that shows a search result entry for the post that we've called "23". When a search is performed, it gets a list of all the ids that contain those words, then retrieves all the search result items for those ids. Those variables are put in search/posts
.
Whew.
Hopefully the back-end is working now. It's time to get this working on the front end.
Since I'm stemming on the back-end, I need to stem on the front end. I used the Porter-Stemmer algorithm as provided by Martin Porter. Download the file and put it on your server.
Next, create a file called site-search.js
on your server. Here's the first bit of code for it:
var searchTimer;
$(document).ready( function () {
$('#search-bar .page-bounds').prepend('<div id="search-results"></div>');
$('#search-text').keydown( function () {
if (searchTimer == null)
searchTimer = setTimeout("siteSearch($('#search-text').val())", 500);
});
})
This code sets a global Javascript variable, creates an element for search results, and then listens for changes to the search text box.
Here's the code that does the actual searching:
siteSearch = function(w) {
var words;
var o = this;
o.parseWords = function(w) {
// parse the words out of the query
words = w.toLowerCase().match(/\w{2,}/gi);
// convert the array to stemmed words
sWords = new Array();
for (w2 in words) {
stem = stemmer(words[w2]);
if ($.inArray(stem, sWords) == -1) {
sWords.push(stem);
}
}
// return the stemmed version
return sWords;
};
o.getIndexUrls = function(ws) {
// create an array of urls pointing to the first letter of each word
files = new Object();
for (word in ws) {
temp = '/search/terms/' + ws[word].substring(0, 2).toLowerCase() + '.json';
files[temp] = null;
}
return files;
};
o.loadIndexes = function(is) {
// make an ajax call to get all the indexes
for (file in is) {
$.getJSON(file, o.getPostIds);
}
};
o.getPostIds = function(ts) {
if (!o.posts.length) {
o.posts = new Array();
}
// loop through the terms, then the ids for each term
for (var term in ts) {
// if the index term matches one of our search terms, add it to the list of posts
if ($.inArray(term, o.words) != -1) {
for (id in ts[term]) {
if (!o.posts[ts[term][id]]) {
o.posts[ts[term][id]] = 1;
} else {
o.posts[ts[term][id]] ++;
}
}
}
}
};
o.getPosts = function() {
var sortable = [];
for (var postId in o.posts) {
sortable.push([postId, o.posts[postId]])
}
o.posts = sortable.sort(function(a, b) {return b[1] - a[1]})
for (var i=0; i < o.posts.length && i < 20; i++) {
$.get('/search/posts/' + o.posts[i][0] + '.html', o.loadPostData);
}
$(document).unbind();
};
o.loadPostData = function(ts) {
$('#search-results').append(ts);
}
o.clearResults = function() {
$(document).unbind();
$('#search-results').empty();
$('#search-results').hide();
$('.form-search i').remove();
$('.form-search input').val('');
}
// here's the main code of the function
clearTimeout(searchTimer);
searchTimer = null;
$(document).unbind();
$('#search-results').empty();
$('#search-results').hide();
o.posts = new Array();
o.words = o.parseWords(w);
o.indexUrls = o.getIndexUrls(o.words);
o.loadIndexes(o.indexUrls);
$(document).ajaxStop(function () {
if (o.posts.length) {
o.getPosts();
$('#search-results').css('top', $('.form-search').offset().top + 34);
$('#search-results').show()
$('<i class="icon-remove"></i>').appendTo(".form-search").click(o.clearResults);
}
});
};
The only thing complicated about this code is the AJAX-iness of it. Essentially, it takes what's currently in the #search-text box, converts it to lower case, stems it, and then loads the JSON files based on the first two letters of each search word. Within that JSON file, it finds the specific word it's looking for and creates an array of post ids. If there are multiple words, it gets multiple lists of ids and merges them into one array.
Once it has a list of all the ids, it has to get a search result item for each of those ids, so it makes another JSON call - this time to get the search result snippet.
If you look at the code on Github, you'll see that the JQuery code is a little fancier than what I've described here. It interprets key commands and allows the user to easily highlight an item with arrow keys and go to that page using the return key.
I'd like to add pagination at some point, but that's a ways down the line.
You can see the whole thing working at marran.com. And all the source code is in https://github.com/captaincanine/marran.com.
No servers working overtime to make it happen!