Note: this post is now on blog.searchmysite.net at https://blog.searchmysite.net/posts/searchmysite.net-building-a-simple-search-for-non-commercial-websites/.
Introduction Link to heading
I’ve written previously about what went wrong with the internet and how to fix it, and one of the ideas I mentioned was a new model for search. Given “talk is cheap, show me the code”, I decided to implement it. Okay, it wasn’t quite that easy, but here it is: https://searchmysite.net.
The key features are that it:
- Contains only sites submitted by verified site owners, as a form of quality control.
- Contains no adverts, and downranks results containing adverts to discourage “Search Engine Optimisation”, clickbait content etc. (note that there is a model for sustaining the service long term without having to resort to advertising).
- Features a very high degree of privacy (no persistant cookies, only one session cookie in the Add My Site and Manage My Site sections, no code downloaded from third parties, etc.)
- Has an API for site owners to e.g. inspect their data and add a search box to their own sites.
- Has filters for site owners to customise their indexing process.
To quickly recap the idea, it has been inspired by the growing interest in the noncommercial web and the reaction against the over-commercialisation of the internet and the problems that brings. On forums like Hacker News, for example, there have been lot of comments about how hard it is to find all the fun and interesting content from personal websites and blogs nowadays and how the advertising funded search model is broken1.
There’s quite a bit more I could write about the origin and evolution of the idea, why certain design decisions were made, and so on, but for the rest of this post I’ll just stick to the technical details.
In summary, there are 4 main components:
- Search: Apache Solr search server.
- Indexer: Scrapy indexing scripts.
- Database: Postgres database (for site and index management).
- Web: Apache httpd with mod_wsgi web server, with static assets (including home page), and dynamic pages (including API).
Search Link to heading
Search engine Link to heading
Given that the solution includes a search as a service, I wanted to set up my own search infrastructure rather than build on top of another existing search as a service, or buy results from an existing search engine.
I chose Apache Solr, rather than the more trendy Elasticsearch, because I know it better, and it is proper open source. Both Elasticsearch and Solr are based on Apache Lucene anyway.
In terms of Solr setup, I’ve tried to keep it as simple as possible to start with. In particular, I’ve not set up SolrCloud or sharding, because this additional complexity should only be used if there is a genuine need for it. If I need to scale it, first step would simply be more memory and CPUs. The next step would be to move to SolrCloud with 3 nodes and an external Zookeeper. Longer term I could even get the content much closer to users by separating read and write nodes and having the read nodes globally distributed via the Cross Data Centre Replication feature. So there should be plenty of opportunities for scaling.
Relevancy tuning Link to heading
I’ve just performed some really basic relevancy tuning for the time being:
<str name="qf">title^1.5 tags^1.2 description^1.2 url^1.2 author^1.1 body</str> <str name="pf">title^1.5 tags^1.2 description^1.2 url^1.2 author^1.1 body</str> <str name="bq">is_home:true^2.5 contains_adverts:false^15</str>
Key points are that home pages get a good boost, and pages without adverts get a massive boost. Some basic testing suggests that pages with adverts tend to come towards the end of the results, which is an interesting contrast to the commercial search engines which are often gamed to put the pages with the most adverts nearer the top. There are times when pages with adverts come higher up of course, e.g. if they are a home page with the search term in the title, but that’s probably fine. Note that the use of qf (Query Fields), pf (Phrase Fields) and bq (Boost Query) parameters means I can’t use the standard query parser, so I’ve switched to eDisMax.
As the system gets more content, I’ll definitely need to revisit the relevancy tuning. Search is actually a very hard problem, even though people often think it is easy, and relevancy tuning in particular is one of those topics that is both difficult to do well and vastly underappreciated. Over the years I’ve heard many people think they’re being helpful or think they’re coming up with a great idea in suggesting a search should be “more like major-internet-search-X”, without realising the enormous sums of money major-internet-search-X spends on R&D, both developing and deploying some of the most state-of-the-art AI/ML, and employing a vast army of people making manual tweaks and performing quality analyses by hand.
Anyway, one of the first changes I’d like to take a look at once there is more content is a form of PageRank. I’m already gathering what I call the indexed outlinks, i.e. links to pages on other sites which have been included in the search, and I have a plan to generate (at index time) what I call the indexed inlinks, i.e. the links to that page from other pages that have been included in the search. The scoring could be implemented by a boost function which counts the number of “indexed inlinks”. That’s not exactly the same thing as the full PageRank, but I’m not planning on indexing the entire internet any time soon.
Another early change could include some kind of scoring for recency. I haven’t done so yet because I’m not sure how reliable the date related data is. For example, many pages don’t have a published date, and many static site generators change the last modified date for every page every time the site is regenerated irrespective of whether the content on that page or the template for that page has changed.
Much further down the road there’s plenty of other potential improvements I could look at, e.g. semantic search, natural language queries, and so on.
Indexer Link to heading
Indexing Link to heading
I first looked at Apache Nutch for indexing, and then Stormcrawler. Nutch was more orientated towards batch processing, and Stormcrawler stream processing, with both geared towards indexing potentially large numbers of sites concurrently. However, with the search as a service, I felt I was going to need more granular control of individual site indexing than either of those seemed to offer. This is to be able to, for example, show the site owner roughly when their next index time is due, and perhaps even allow them to trigger an on-demand index.
I settled on Scrapy. It has some good documentation, and it is a popular tool so there’s plenty of information about it available. I was able to get some basic indexing of a site into Solr with 3 lines of custom code, which is the sort of thing I like:
def open_spider(self, spider): self.solr = pysolr.Solr(self.solr_url) # always_commit=False by default def process_item(self, item, spider): self.solr.add(dict(item)) # doesn't commit, and would be slow if it did def close_spider(self, spider): self.solr.commit()
That said, I do have more custom code now for things like custom deduplication and dealing with deleted or moved documents.
One of the important customisations that might be worth mentioning, is to have a cap on the number of documents per site. This is important to keep the index size and the indexing time manageable in the short term. Unfortunately the standard approach of using CLOSESPIDER_ITEMCOUNT only works at the class level, and I needed it at the instance level (so e.g. different sites could have different values), so I have a counter which raises a CloseSpider when a site’s limit is reached.
Another point that might be worth mentioning, is that I had hoped to avoid storing the body, to try to keep the index size more manageable, but I needed to store it for the results highlighting to work.
Scheduling Link to heading
Although I’ve tried to reuse existing solutions wherever possible, I couldn’t really find anything that fit my requirements when it came to scheduling. Firstly, I wanted to be able to report on indexing status for each individual domain, so the site owners could see when the last index was completed and when the next index was due, or if the indexing was currently running. But secondly, I didn’t want to schedule each site separately, given that if this is successful there could be 1000s of sites, and I didn’t want to manage 1000s of scheduled jobs. So I wrote a custom solution.
This uses a database to keep track of status for each site and a continuously running script that reads the database for sites to reindex. If there’s a lot happening at once, multiple sites will be indexed on one server asynchronously within the Twisted reactor, and multiple indexing servers can run concurrently (longer term these could even be geographically distributed to be closer to the sites they index).
Database Link to heading
The database is Postgres DB. It is primarily used for recording details of the submitted sites, i.e. home page, whether it has been verified, data of verification, etc.
It was tempting to keep things simple and store everything in Solr, but it is not good practice to master important data in a search engine, and it is easier to setup backups and so on from a database.
The database is also used for keeping track of the indexing status as per above.
Web Link to heading
API layer Link to heading
For the API, the simplest option would have been to use the out-of-the-box Solr http API. However, I’ve built my own API layer between the frontend and Solr. Having an abstraction layer like this is good practice: (i) to simplify the API as much as possible, (ii) protect users from underlying technology changes, and perhaps most importantly (iii) block all access to the Solr URLs for security.
I chose Flask because it seems simple (it calls itself a micro framework, and I generally like technologies with micro in the description) and is widely used. Plus I’ve been doing a fair bit with Python recently.
User interface layer Link to heading
I chose Bootstrap for the CSS framework, because it is good and popular.
I also decided to keep the home page entirely static, while the other pages are dynamic. The theory is that this should keep the home page loading nice and fast. It means there’s a little extra maintenance, with any updates to the dynamic template having to be manually applied to the “hard coded” home page, but I can live with that for now and perhaps automate later.
I also keep all the static assets on the site, so there isn’t anything downloaded from any other domains. This is to try and address privacy concerns, by eliminating the possibilities of any third parties tracking the site usage. I know that will potentially adversely impact the performance, so I may review in future.
Its all pretty plain Bootstrap at the moment because I really wanted to focus on the functionality at first.
On the UI layer, one thing I did spend a small amount of time on, because I thought it would be fun if nothing else, was a custom icon to flag the results that contained adverts. I used GIMP to create an icon reminiscent of the Parental Advisory warning:
Unfortunately, in order to remain readable, it has to be fairly large, and I was concerned it would be too in-your-face, so ended up using something more discreet instead. At the time or writing there aren’t any pages in the production version which contain any adverts, so it isn’t visible yet.
Other information Link to heading
Development and production environments Link to heading
I settled on splitting the service into 4 Docker containers matching the 4 main components:
They’re managed via Docker Compose locally, so the entire development environment can be started up with one command, which is nice. The theory is that, by splitting them out into their own environments now, it should be possible to scale more easily later.
I decided to deploy to AWS, again because I’m familiar with it already. Although now I’m having to look very closely at costs, it seems many of the services could become very expensive pretty quickly. I did sign up for AWS Activate Founders, which should give a decent amount of credits for the first couple of years. Even so, I’ve remained very cost conscious, and ended up putting everything on one EC2 instance and doing things the “hard way” rather than using the “value add” services.
I started out with the t2.micro, which is on the free tier, for initial setup and testing. I then moved to t3.small for testing with other people, and will move to a t3.medium before any events which might generate a lot of traffic, e.g. a Show HN.
Capacity Link to heading
If my very rough estimates are correct, I should be able to support indexing for around 1000 sites on one t3.medium. That might be the point at which I’d need to look at implementing the listing fee and/or search as a service fee. I say and/or because it would be one or the other at first, but may end up being both - will need to get feedback.
Domain names Link to heading
I was originally hoping to get “search.me”, which seemed available as a “premium .me domain”. “search me” is colloquial English for a shrug of the shoulders and an “I don’t know”, which is pretty much the opposite of the image that an all-seeing and all-knowing commercial search engine would want to portray, but as a non-commercial search engine aiming to help find fun and interesting things on the internet, the dual meaning could have worked really well. I was even imagining a logo of a stick figure shrugging their shoulders with their palms in the air. Unfortunately it turns out that search.me “is reserved for the future use by the .me registry and the Government of Montenegro”.
I found searchmysite.net was available, so registered it. searchmysite.com was also for sale from a squatter for not-too-crazy a price, and I have to say I was a little tempted, but as soon as I registered searchmysite.net they whacked up the price dramatically, so now I don’t want to buy it on principle.
Still not convinced it is the best name, so open to suggestions. I can’t think of a search engine that actually has “search” in its name. Even in the early days it was names like “Infoseek” and “AltaVista”.
How long it all took Link to heading
I’d love to be able to say I built the Minimum Viable Product (MVP) in the space of one weekend, but unfortunately building a simple but scalable system like this took a little longer. In terms of duration, I started trying a few things out at the end of May, did the initial commit on 13 June, and had the first site submissions on 17 July. In terms of effort, I had roughly 2 hours each night after the children were in bed, the odd bit over weekends (although I try to keep weekends for family time), and I also took a week off work (22 June to 26 June for the record) somewhat optimistically hoping to finish the bulk of the initial version then.
The first few submissions and changes Link to heading
Based on the first 6 submissions, there were quite a few changes I made, the most important of which were:
Allow filtering out of certain URL paths, configurable on a per domain basis. The first URLs that needed filtering were lists of search queries, e.g. /search/?query=&filter=Technology . I was going to simply not index any pages with a query string, but then I realised that this would miss a lot of content from some other sites which used links like /stream?type=article for navigation, so I had to make the exclude paths configurable.
Allow filtering out of certain page types. It turns out that many sites have 1000s or 10s of 1000s of microblog posts, which means much or all of the indexing page limit could be taken up by pages which sometimes consisted of as little as one word. While I think there is a case for indexing this content, it would be a very different solution, e.g. with content pushed for near real-time search, an alternative interface, etc. For a traditional style search engine I think the more longform and more long lived content is more relevant. So I implemented a solution which detects page type when indexing (currently using
article data-post-type=) and allows users to prevent specified page types from being indexed. Unlike the URL path filtering, this requires the indexer to open and parse the page, so it could potentially have an impact on indexing times, but that is something to monitor. I did implement a solution for users to define custom page types using xpath expressions, but I removed this because it started making what should be a simple solution look quite complicated. I could always add that back if there is demand, because it could be a very powerful feature, e.g. combined with the query filters in the API.
Total costs Link to heading
Ignoring the cost of my time, upfront costs have been nearly zero, and running costs should be manageable short term.
Domain registration was £13.68. The SSL certs were free through LetsEncrypt, and the initial hosting was on the t2.micro free tier. Assuming I can keep it to one t3.medium EC2 instance, and not use any of the other paid services, the cost for the server should be under £300 a year, and as per above may even be free for the first two years under the AWS Activate Founders scheme.
If I start getting a lot of traffic and have to upgrade, there is the plan to cover the running costs.
Next steps & the future Link to heading
Well, now it’s out there, the next step is to see if there is any interest. And if there is, whether it is more for the search for independent sites, or for the search as a service. Or even for a different direction beyond those, e.g. as a community curated search. Needless to say, there are a lot of potential improvements that could be made.
The good things about this approach are that it can:
- Start small and grow organically. A whole-internet search isn’t really something you can build in your garage any more, as evidenced by all the new whole-internet search engines which are simply buying their data from the incumbents2. Not that I’d want to index the whole internet these days anyway given the amount of spam on it. Best case, in the very long term, would be if all the good sites were indexed in this search.
- Be sustainable if successful. There’s no point in having a great service that everyone loves if it can’t be sustained. So many of the internet giants have burned through vast amounts of money with no idea how they can become profitable, and so simply fall back on advertising. It might never be able to generate equivalent revenue to advertising funded sites, but this is hopefully part of a new era of “$1 million dollars isn’t cool. You know what’s cool? Making the internet a better place”.
e.g. The Return of the 90s Web, Rediscovering the Small Web, If I could bring one thing back to the internet it would be blogs , Ask HN: Is there a search engine which excludes the world’s biggest websites?, Mozilla goes incubator with ‘Fix The Internet’ startup early-stage investments . ↩︎
e.g. a search startup recently getting $37.5m in funding to buy their search results from another search engine: One company’s plan to build a search engine …. ↩︎