I recently read Thijs Feryn post on the maturity of PHP and he makes a convincing argument for actually using what PHP provides with 5.5 now being stable.
Drupal 7 has seen many improvements over the past year in regards to 5.4 compatibility and as it turns out, it now runs pretty smoothly, though many modules still create E_NOTICEs due to legacy code. The breakage which 5.4 caused in many places when Drupal 7 (and let's not even talk about a 6 site) came out is not comparable with upgrading from 5.4 to 5.5, which is mostly harmless.
Managing PHP versions
You have basically two options for upgrading PHP on your server: find some pre-compiled packages or build from source. The latter should be avoided by nearly everyone, unless you're willing to diligently subscribe to a release mailing list and compile again and again.
On CentOS/RHEL, which is my preferred hosting platform at the moment one can rely on the excellent IUS repository. It makes switching between 5.3, 5.4 and 5.5 extremely easy. Once you have IUS installed, you can select your packages via the name php5Xu. Which means you can switch between 5.3 and 5.5 simply by removing the packages installed from php53u and installing again with php55u. All I needed afterwards to get PHP running again was copying the /etc/php-fpm.d/www. conf.rpmsave over www. conf and so 15 minutes later this blog is running on 5.5:
# yum list installed | grep php
php55u-cli.x86_64 5.5.5-2.ius.centos6 @ius
php55u-common.x86_64 5.5.5-2.ius.centos6 @ius
php55u-fpm.x86_64 5.5.5-2.ius.centos6 @ius
php55u-gd.x86_64 5.5.5-2.ius.centos6 @ius
An additional benefit is that IUS provides packages in sync with the latest released package. This is very helpful in determining what you are actually running. For example, your server might report that you are running 5.3.3 (as stock CentOS 6.4 does), even if critical fixes have been backported by RedHat, Fedora, et al from later versions, with IUS you'll get a proper 5.3.27.
Caveat: Of course, don't try this on production if you don't know what you're doing.
Bruce Sterling has struck again. At the Webstock '13 conference he gave a talk (as he is prone to do) in which he looked at the web and society in 2013 and produced a fascinating narrative around it.
The first important point I took away from his talk is how critically divergent the global online companies we depend on have become from traditional 20th century corporations as well as from the open web itself. He calls these vertically-integrated, global organizations stacks and identifies key factors in their difference to the web, such as a a proprietary operating system and a post-Internet, non-jailbrakeable stack-device (tablets, phones, e-books, et al) and much much more. How does he put it?
The Internet had users, stacks have livestock.
Interestingly, the stacks are highly unstable, he says, and he believes they shouldn't outlive the Arpanet in the long-run, proves it by cat, and you feel a kind of dark euphoria for what's to come.
In sum, just go watch his talk, I couldn't fit this into 140 characters.
There is an old entry in my backlog of ideas for the blog which went along the lines of try to get a copy of what data mining marketing firms have on file for you. Nothing ever came of that until I saw Ed Felten's post on one of those companies doing just that, with a spiffy site to boot, and I'm really surprised about this.
A quick primer
Most everyone knows that companies as a whole are tracking purchasing habits and customer preferences and correlating them with socio-demographic information to sell you more stuff. The who and what is much less widely know though.
Most often these collection companies are specialized businesses, which sell subscriptions to their databases, which they feed from multiple sources, especially in the U.S. private as well as public sources. At the time I jotted down a few notes on the topic ChoicePoint was one of them.
Most of these companies do not simply stop at providing information on marketing data to business and government (yes, the U.S. government is not prohibited from using private databases, even if they are not allowed to create those on citizens in the first place). They often have a close relationship to data based on or provided by credit scoring agencies, such as Altegrity, TransUnion or ISU; which are listed as direct competitors of ChoicePoint by Hoovers. The differences in their aggregate datasets are probably small, the impact of the result on individuals often not. ChoicePoint was later acquired by LexisNexis, which primarily focuses on publication databases, and some divisions by Acxiom. It shows the further agglomeration of databases of different types within these data brokers.
If I had been asked a week ago, I would have concluded that such data brokers have no interest in providing a site such as aboutthedata.com. It seems intuitive that such brokers would want to collect as much data as possible on their subjects without giving them reason to share less data with them. Ideally, to not be noticed at all.
This could be an initiative of Acxiom to improve the public perception of data mining firms such as themselves. It could be an attempt to preempt any negative coverage on data brokers due to the continuing public discourse on surveillance et al. Or it could just be an attempt to further improve their data.
Let's try it out
Since I did spent several years in the U.S. I figured that they should have at least something on me. I entered my personal data from my last residence and was surprised to find that in the categories shown above they had basically nothing, except for an inferred marital status and income bracket, which matched, but that would have been possible to extrapolate from the address and age itself I entered to register. Ed Felten was more successful but also apparently underwhelmed by the level of detail shown.
My speculations on why this is so fall into two basic groups:
1. They are only showing a minimal set
It's possible that Acxiom is only providing their basic result data and keeping any other further (and more creepy) analysis results to themselves and their clients, but since that's pure speculation I'm going to ignore this avenue for now.
Also, I have become in general skeptical of the predictive power of big data in shaping or determining customer behavior, which is in my opinion often at best on par with a trained sales representative.
2. I didn't provide enough data
Two of the primary categories provided by Acxiom are basically public databases with vehicle and house ownership, neither applied to me at the time. Also, I generally did not sign up to loyalty cards. It's possible and likely that Acxiom then either really does not have much more information on myself or that they are unable to merge incomplete datasets on my person into a consistent record.
The latter case highlights the main problem with abouthedata.com: Basically, I can be pretty certain that I'm not seeing the full list of entries Acxiom has on me, simply due to the fact that a human operator would have to make a judgment call, whether a fragment refers to the same canonical person or not. They are unlikely to be able to manually merge a significant number of entries, which would mean that there are have to be significant numbers of entries in their database which they cannot bring together in this web application by algorithm alone. They cannot ask me "is this you as well" without accidentally disclosing data from someone else in many cases, they have to err on the side of caution, more so than is probably necessary for most of their customers.
Thus, I'm still left wondering what the site is supposed to accomplish. Calm me? Improve the paper spam selection? Not sure.
For some documents you have to retain the original in dead tree storage format. For most documents which arrive in the mail, however, a digital copy is just fine and there really isn't any need to retain the paper version, especially if your computer can store millions of them in the space needed for one paper binder.
To archive such documents one could now buy a scanner, maybe even an ADF office appliance which spits out a searchable PDF and all that and let the device collect dust for the next few months until another batch is processed. However, you can also achieve nearly the same result with your phone and a Linux desktop.
Step 1: Capture
Firstly you will need to actually digitize your documents, you can of course use any scanner for this but a phone can be the perfect device to quickly capture dozens of documents, often vastly faster than with a flatbed scanner optimized for photos.
I personally am using the Scanbox to do this but any contraption which can hold your phone or digital camera steady such as a tripod mounted at similar distance to the document should do the trick. Speed is the important factor in my solution, not accurately capturing 6pt legalese in the footer.
Step 2: Format
After capturing you might need to rotate your photos first to get the page into portrait mode. Watch out though that the generic image preview in Gnome might rotate on-the-fly from EXIF data and not tell you. If you were to OCR those files, you would not get any text from it. You can check if your files still need to be rotated by opening them in GIMP, which will ask you if you want to rotate. You can bulk-rotate according to the camera setting with:
mogrify -auto-orient *.jpg
If your camera orientation did not match your document orientation, you'll have to convert by hand, the latter will do so 90 degress clockwise:
mogrify -rotate 90 *.jpg
Step 3: Process
Now you are ready to convert your images to PDF with text in them, basically, all you need to do is to call hocr2pdf and tesseract, the rest of the script below is only concerned with naming things and cleaning up. Thus the packages tesseract-ocr-eng, imagemagick and exactimage should be all that's needed on Debian-based systems, it worked flawlessly for myself with Ubuntu 13.04. Essentially, it's a cruder version of Konrad Voelkel's solution.
for f in *.jpg
tesseract $f $filename -l eng hocr
hocr2pdf -i $filename.jpg -s -o $filename.pdf < $filename.html
# I wouldn't do this, but you could...
# rm $filename.jpg
Et voilà, you have a searchable PDF which you can locate with the desktop search of your choice, for example Recoll.
Remember the Milk, RTM for short, is a popular web-based todo list manager. You can access your tasks in your browser, through an app and they provide a print template. Exporting the actual data is a bit more cumbersome: They consider iCalender to be the primary backup mechanism and the only other thing they provide is an Atom feed but the content is a bunch of <span> tags. Nothing you'd want to parse.
However, they also provide an API and David Waring has made use of it with his rtm-cli python script. With very little effort I was able to amend his script to include a function to create a csv file for all general RTM fields without notes. Since it's based on his ls function, you can use rtm-cli's general filtering functions. If you wanted to export all unfinished tasks you would execute the following:
$ python rtm -c csv
It would create a UTF-8 file output.csv in your working directory with a structure as such:
123;"Inbox";"Error reporting broken";;N;;;"http://www.example.com";
125;"Inbox";"Add web service";;N;2013-07-21T00:00:00+02:00;;;
What is Locust?
Locust is an exciting new framework to do load testing on a site. Test scenarios are written in Python and are easily customizable. I used it to create benchmarks for a performance comparison of several cloud hosting providers.
There are several alternatives to Locust, such as the humble apache benchmark (ab), Siege, Proxysniffer and many others. Though, in my opinion, none of those offer the ability to realistically mimic site specific usage patterns, while still being incredibly easy to set up, providing a great interface for using it, and being open source.
It also offers features such as distributed load testing through a master/client setup as well as "ramping up" load testing to determine stability limitations, though I did not make use of those features.
Since test scenarios are basically just python functions which are extending Locust's base functions, one has to first outline the tasks a simulated user should make.
There is nothing wrong with simply specifying a list of urls (possibly parsing a sitemap.xml) and then relying on Locust to let requests randomly hit those URLs within predefined, randomized intervals. Based on the base example on Locust's front page, the following works for a sitemap generated by Drupal's XML Sitemap module:
from locust import Locust, TaskSet, task
from pyquery import PyQuery
r = self.client.get("/sitemap.xml?page=1")
pq = PyQuery(r.content, parser='html')
self.sitemap_links = 
for loc in pq.find('loc'):
url = random.choice(self.sitemap_links)
r = self.client.get(url)
task_set = RandomSitemapWalk
host = "http://www.example.com"
min_wait = 1 * 1000
max_wait = 10 * 1000
Now you can start Locust with that, set the number of users you want to be spawned and start testing your application.
Those tasks by themselves aren't that special, one could as easily parse the sitemap with awk/sed, pass it to Sieve and get a similar results. As the example on their front page shows, procedures such as logging in are trivial and there we can build complex workflows.
Profiling logged in users
Most of the time, anonymous requests aren't really the optimization problem, at least if your node-to-hit ratio isn't so low that Varnish isn't already serving those users quickly. So, to simulate logged in users (as their primary example shows), all that's needed is to POST to the login form, adapted to Drupal 6 this would be done as follows:
"op": "Log in"
From there on we could either reuse our sitemap and let users randomly hit pages or do something with their second base tutorial "Example with HTML parsing" and start responding to the content actually delivered to a logged in user. The following would log in, read the front page, index all links, and choose a target at random. By itself that is rather limited compared to the sitemap set but it should be trivial to work out how to extend this to walk through a varying number of link depths:
"op": "Log in"
# assume all users arrive at the index page
r = self.client.get("/")
pq = PyQuery(r.content)
link_elements = pq("a")
self.urls_on_current_page = 
for l in link_elements:
if "href" in l.attrib:
url = random.choice(self.urls_on_current_page)
r = self.client.get(url)
You can now download the results of these tests as CSV and analyze in depth on the basis of average page performance or fulfillment rate per percentage range per page, or simply store those results for future comparison.
Where to go from here
First of all, I think that defining use cases for a web project at the outset and defining a load testing script along those expected behaviours is a great step in modelling a platform before launch. Furthermore, it makes it easier to evaluate how additional features might impact a live site and its users before deployment. So, just writing a task list and then using Locust's task priorities would be a great way to more closely emulate realistic user navigation paths.