This semester I had a handful of exams rather late in the year, weeks after classes ended. Now, the Internets are a pretty interesting place and I find it extremely difficult to not glance at the feed reader and all the other gizmos rather often, when in front of the screen.
I saw three possible solutions:
- a) Work analog. (Pretty unrealistic, and think of the poor trees…)
- b) Force yourself to not get distracted. (Sounds as likely as most New Year’s resolutions.)
- c) Turn off only the distractions, or at least minimize them.
Naturally, I went for C. It gave me an excuse to go write the scripts for it. The first attempt consisted of a list of domains added to /etc/hosts on my main machine. It worked, but just blocking domain names often gives inconsistent results in respect to subdomains and other problems. I have split this tutorial into two parts to make it more readable. The first part deals with collecting domain names, the second with filtering them by hitting a button on an OpenWrt router.
Many programs dealing with RSS today will often export to OPML (just an XML file) and manipulating these with XmlStarlet works great. Let’s try it on Miro.
We have a list of outline elements, some are groups, some are feeds. The following will extract the URL from all outline elements.
xmlstarlet sel -t -m //outline -v @xmlUrl -n ~/miro_subscriptions.opml
| grep . should also get rid of the empty lines. Just repeat the procedure for the other XML variants your programs produce.
Without proper XML
Sometimes, even when an XMLish notation like HTML is used you get nonvalidating files. For example with Firefox bookmarks (other browsers do it, too). Tidy is an easy way to automatically fix those problems. The following will produce a validating bookmarks file.
tidy -asxml -o cleanbookmarks.html bookmarks.html With this you can run it through XmlStarlet without problems. I adapted the following somewhat cryptic line from the XmlStarlet User Guide. Additionally,
| grep -v "place:" removes the lines from Firefox which aren’t URLs.
xmlstarlet sel --html -T -t -m "//*[local-name()='a']" -v @href -n -n a.xml
From the previous steps you now have a list of URLs. With these, you could already jump ahead to step 2 since the proxy in this project can also filter on URLs or patterns.
If, however, you want to broaden it to domain names, there is one more step. The following Python script takes two arguments, a file with URLs and a filename to write the domains to. Download it directly. Example usage:
./getdomains.py urllist domainlist
If you quickly look over the script, you might notice the four lines below. Since many sites have inconsistent uses of the www prefix, the script looks for all domains with a www prefix. If successful, the domain is added to the list with www and without. This does not cover any other subdomain.
domainshort=domain[4:] o.write("%s \n" % domain) if domain[:3] == "www": o.write("%s \n" % domainshort)
If you don’t have OpenWrt, you can also download a version with two tiny changes to make the ouput readable by /etc/hosts. Just append your entries at the end of that file.
In the next post, we’ll see how one can easily switch using this list on and off with a button, rather than meddling with configuration files each time.