Finding 404s and broken pages in my Apache logs

Sometime earlier this year, I broke the Piwik server-side analytics that I’d been using to count hits to the site. It sat this way for about two months before anybody noticed, which I took as a sign that I didn’t actually need them. I look at them for vanity, nothing more.

Since then, I’ve been using Python to parse my Apache logs, an idea borrowed from Dr. Drang. All I want is a rough view count, and if I work on the raw logs, then I can filter out a lot of noise from things like bots and referrer spam. High-level tools like Piwik and Google Analytics make it much harder to prune your results.

My Apache logs include a list of all the 404 errors: any time that somebody (or something) has found a missing page. This is useful information, because it tells me if I’ve broken something (not unlikely, see above). Although I try to have a helpful 404 page, that’s no substitute for fixing broken pages. So I wrote a script that looks for 404 errors in my Apache logs, and prints the most commonly hit pages – then I can decide whether to fix or ignore them.

The full script is on GitHub, along with some instructions. Below I’ll walk through the part that actually does the hard work.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
page_tally = collections.Counter()

for line in sys.stdin:

    # Any line that isn't a 404 request is uninteresting.
    if '404' not in line:
        continue

    # Parse the line, and check it really is a 404 request; otherwise,
    # discard it.  Then get the page the user was trying to reach.
    hit = PATTERN.match(line).groupdict()
    if hit['status'] != '404':
        continue
    page = hit['request'].split()[1]

    # If it's a 404 that I know I'm not going to fix, discard it.
    if page in WONTFIX_404S:
        continue

    # If I fixed the page after this 404 came in, I'm not interested
    # in hearing about it again.
    if page in FIXED_404S:
        time, _ = hit["time"].split()
        date = datetime.strptime(time, "%d/%b/%Y:%H:%M:%S").date()
        if date <= FIXED_404S[page]:
            continue

        # But I definitely want to know about links I thought I'd
        # fixed but which are still broken.
        print('!! ' + page)
        print(line)
        print('')

    # This is a 404 request that we're interested in; go ahead and
    # add it to the counter.
    page_tally[page] += 1

for page, count in page_tally.most_common(25):
    print('%5d\t%s' % (count, page))

I’m passing the Apache log in to stdin, and looping over the lines. Each line corresponds to a single hit.

On lines 6–7, I’m throwing away all the lines that don’t contain the string “404”. This might let through a few lines that aren’t 404 results – I’m not too fussed. This is just a cheap heuristic to avoid (relatively) slow parsing of lots of lines that I don’t care about.

On lines 11–14, I actually parse the line. My PATTERN regex for parsing the Apache log format comes from Dr. Drang’s post. Now I actually can properly filter for 404 results only, and discard the rest. The request parameter is usually something like GET /about/ HTTP/1.1 – a method, a page and an HTTP version. I only care about the page, so throw away the rest.

Like any public-facing computer, my server is crawled by bots looking for unpatched versions of WordPress and PHP. They’re looking for login pages where they can brute force credentials or exploit known vulnerabilities. I don’t have PHP or WordPress installed, so they show up as 404 errors in my logs.

Once I’m happy that I’m not vulnerable to whatever they’re trying to exploit, I add those pages to WONTFIX_404S. On lines 17–18, I ignore any errors from those pages.

The point of writing this script is to find, and fix, broken pages. Once I’ve fixed the page, the hits are still in the historical logs, but they’re less interesting. I’d like to know if the page is still broken in future, but I already know that it was broken in the past.

When I fix a page, I add it to FIXED_404S, a dictionary in which the keys are pages, and the values are the date on which I think I fixed it. On lines 22–32, I throw away any broken pages that I’ve acknowledged and fixed, if they came in before the fix. But then I highlight anything that’s still broken, because it means my fix didn’t work.

Any hit that hasn’t been skipped by now is “interesting”. It’s a 404’d page that I don’t want to ignore, and that I haven’t fixed in the past. I add 1 to the tally of broken pages, and carry on.

I’ve been using the Counter class from the Python standard library to store my tally. I could use a regular dictionary, but Counter helps clean up a little boilerplate. In particular, I don’t have to initialise a new key in the tally – it starts at a default of 0 – and at the end of the script, I can use the most_common() method to see the 404’d pages that are hit most often. That helps me prioritise what pages I want to fix.

Here’s a snippet from the output when I first ran the script:

23656   /atom.xml
14161   /robots.txt
 3199   /favicon.ico
 3075   /apple-touch-icon.png
  412   /wp-login.php
  401   /blog/2013/03/pinboard-backups/

Most of the actually broken or missing pages were easy to fix. In ten minutes, I fixed ~90% of the 404 problems that had occurred since I turned on Apache last August.

I don’t know how often I’ll actually run this script. I’ve fixed the most common errors; it’ll be a while before I have enough logs to make it worth doing another round of fixes. But it’s useful to have in my back pocket for a rainy day.


A Python smtplib wrapper for FastMail

Sometimes I want to send email from a Python script on my Mac. Up to now, my approach has been to shell out to osascript, and use AppleScript to invoke Mail.app to compose and send the message. This is sub-optimal on several levels:

  • It relies on Mail.app having up-to-date email config;
  • The compose window of Mail.app briefly pops into view, stealing focus from my main task;
  • Having a Python script shell out to run AppleScript is an ugly hack.

Plus it was a bit buggy and unreliable. Not a great solution.

My needs are fairly basic: I just want to be able to send a message from my email address, with a bit of body text and a subject, and optionally an attachment or two. And I’m only sending messages from one email provider, FastMail.

Since the Python standard library includes smtplib, I decided to give that a try.

After a bit of mucking around, I came up with this wrapper:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from email import encoders
from email.mime.base import MIMEBase
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
import smtplib

class FastMailSMTP(smtplib.SMTP_SSL):
    """A wrapper for handling SMTP connections to FastMail."""

    def __init__(self, username, password):
        super().__init__('mail.messagingengine.com', port=465)
        self.login(username, password)

    def send_message(self, *,
                     from_addr,
                     to_addrs,
                     msg,
                     subject,
                     attachments=None):
        msg_root = MIMEMultipart()
        msg_root['Subject'] = subject
        msg_root['From'] = from_addr
        msg_root['To'] = ', '.join(to_addrs)

        msg_alternative = MIMEMultipart('alternative')
        msg_root.attach(msg_alternative)
        msg_alternative.attach(MIMEText(msg))

        if attachments:
            for attachment in attachments:
                prt = MIMEBase('application', "octet-stream")
                prt.set_payload(open(attachment, "rb").read())
                encoders.encode_base64(prt)
                prt.add_header(
                    'Content-Disposition', 'attachment; filename="%s"'
                    % attachment.replace('"', ''))
                msg_root.attach(prt)

        self.sendmail(from_addr, to_addrs, msg_root.as_string())

Lines 7–12 create a subclass of smtplib.SMTP_SSL, and uses the supplied credentials to log into FastMail. Annoyingly, this subclassing is broken on Python 2, because SMTP_SSL is an old-style class, and so super() doesn’t work. I only use Python 3 these days, so that’s okay for me, but you’ll need to change that if you want a backport.

For getting my username/password into the script, I use the keyring module. It gets them from the system keychain, which feels pretty secure. My email credentials are important – I don’t just want to store them in an environment variable or a hard-coded string.

Lines 14–19 defines a convenience wrapper for sending a message. The * in the arguments list denotes the end of positional arguments – all the remaining arguments have to be called as keyword arguments. This is a new feature in Python 3, and I really like it, especially for functions with lots of arguments. It helps enforce clarity in the calling code.

In lines 20–23, I’m setting up a MIME message with my email headers. I deliberately use a multi-part MIME message so that I can add attachments later, if I want.

Then I add the body text. With MIME, you can send multiple versions of the body: a plain text and an HTML version, and the recipient’s client can choose which to display. In practice, I almost always use plaintext email, so that’s all I’ve implemented. If you want HTML, see Stack Overflow.

Then lines 29–37 add the attachments – if there are any. Note that I use None as the default value for the attachments argument, not an empty list – this is to avoid any gotchas around mutable default arguments.

Finally, on line 39, I call the sendmail method from the SMTP class, which actually dispatches the message into the aether.

The nice thing about subclassing the standard SMTP class is that I can use my wrapper class as a drop-in replacement. Like so:

with FastMailSMTP(user, pw) as server:
    server.send_message(from_addr='hello@example.org',
                        to_addrs=['jane@doe.net', 'john@smith.org'],
                        msg='Hello world from Python!',
                        subject='Sent from smtplib',
                        attachments=['myfile.txt'])

I think this is a cleaner interface to email. Mucking about with MIME messages and SMTP is a necessary evil, but I don’t always care about those details. If I’m writing a script where email support is an orthogonal feature, it’s nice to have them abstracted away.


Safely deleting a file called ‘-rf *’

Odd thing that happened at work today: we accidentally created a file called -rf * on one of our dev boxes. Linux allows almost any character in a filename, with the exception of the null byte and a slash, which lets you create unhelpfully-named files like this. (Although you can’t create -rf /.)

You have to be a bit careful deleting a file like that, because running rm -rf * usually deletes everything in the current directory. You could try quoting it – rm "-rf *", perhaps – but you have to be careful to get the quotes right.

On systems with a GUI, you can just use the graphical file manager, which doesn’t care about shell flags. But most of our boxes are accessed via SSH, and only present a command line.

Another possible fix is to rename the file to something which doesn’t look like a shell command, and delete the renamed file. But trying to do mv "-rf *" has the same quoting issues as before.

In the end, I went with an inelegant but practical solution:

$ python -c 'import shutil; shutil.move("-rf *", "deleteme")'

Python doesn’t know anything about these Unix flags, so it renames the file without complaint.

I feel like I should probably know how to quote the filename correctly to delete this without going to Python, but sometimes safety and pragmatism trump elegance. This works, and it got us out of a mildly tricky spot.

Hopefully this is not something many people need to fix, but now it’s all sorted, I can’t help but find the whole thing mildly amusing.


“The document could not be saved”

I try not to make unreasonable complaints about the quality of software. I write software for a living, and writing bug-free software is hard. A trawl through the code I’ve written would reveal many embarrassing or annoying bugs. People in glass houses, etc.

But I do have some basic expectations of the software I use. For example, I expect that my text editor should be able to open, edit, and save files.

A screenshot of a TextMate window showing a dialog "The document 'hello.txt' could not be saved.  Please check Console output for reason."

A screenshot of a TextMate window showing a dialog “The document ‘hello.txt’ could not be saved. Please check Console output for reason.”

Two days ago, TextMate on my laptop decided it didn’t fancy saving files any more. Actually, it decided that ⌘+S should do nothing, and trying to click “Save” in the File menu would throw this error dialog. A trawl of Google suggests that I’m the only person who’s ever hit this error. If so, here’s a quick write-up of what I tried for the next person to run into it.

Continue reading →


Hiding the YouTube search bar

This morning, I got an email from Sam, asking if I had a way to cover up the persistent YouTube search bar:

Three years ago, I wrote a bookmarklet for cleaning up the worst of the Google Maps interface, and we can adapt this to clean up YouTube as well. Unlike that post, this is one I’m likely to use myself. (Writing the Maps bookmarklet was a fun exercise in JavaScript, but I almost always use Google Maps on my phone, so I was never as annoyed by the clutter on the desktop version.)

If we do “Inspect Element” on a YouTube page, we can find the element that contains this search box: <div id="yt-masthead-container">. So we want to toggle the visibility of this element. Since it’s only one item, we can write a much smaller JavaScript snippet for toggling the visibility:

var search_bar = document.getElementById("yt-masthead-container");

// Check if it's already hidden
var hidden = (window.getComputedStyle(search_bar)).getPropertyValue("display");

// Set the visibility based on the opposite of the current state
void(search_bar.style.display = (hidden == "none" ? "" : "none"));

To use this code, drag this link to your bookmarks bar:

Toggle the YouTube search bar

Simply click it once to make the bar disappear, and click it again to bring it all back.

Something that wasn’t in my original Google Maps bookmarklet is that void() call. It turns out that if a bookmarklet returns a value, it’s supposed to replace the current page with that value. Which strikes me as bizarre, but that’s what Chrome does, so it broke the page. (Safari doesn’t – not sure if that’s a bug or a feature.) The void function prevents that from happening.

This isn’t perfect – content below the bar doesn’t reflow to take up the available space – but the bar no longer hangs over content as you scroll. I think I’ll find this useful when I’m pressed for space on small screens. It’s a bit more screen real-estate I can reclaim. Thanks for the idea, Sam!


Treat regular expressions as code, not magic

Regular expressions (or regexes) have a reputation for being unreadable. They provide a very powerful way to manipulate text, in a very compact syntax, but it can be tricky to work out what they’re doing. If you don’t write them carefully, you can end up with an unmaintainable monstrosity.

Some regexes are just pathological1, but the vast majority are more tractable. What matters is how they’re written. It’s not difficult to write regexes that are easy to read – and that makes them easy to edit, maintain, and test. This post has a few of my tips for making regexes that are more readable.

Here’s a non-trivial regex that we’d like to read:

MYSTERY = r'^v?([0-9]+)(\.([0-9]+)(\.([0-9]+[a-z]))?)?$'

What’s it trying to parse? Let’s break it down.

Tip 1: Split your regex over multiple lines

A common code smell is “clever” one-liners. Lots of things happen on a single line, which makes it easy to get confused and make mistakes. Since disk space is rarely at a premium (at least, not any more), it’s better to break these up across multiple lines, into simpler, more understandable statements.

Regexes are an extreme version of clever one-liners. Splitting a regex over multiple lines can highlight the natural groups, and make it easier to parse. Here’s what our regex looks like, with some newlines and indentation:

MYSTERY = (
    r'^v?'
    r'([0-9]+)'
    r'('
        r'\.([0-9]+)'
        r'('
            r'\.([0-9]+[a-z])'
        r')?'
    r')?$'
)

This is the same string, but broken into small fragments. Each fragment is much simpler than the whole, and you can start to understand what the regex is doing by analysing each fragment individually. And just as whitespace and indentation are helpful in non-regex code, here they help to convey the structure – different groups are indented to different levels.

So now we have some idea of what this regex is matching. But what was it trying to match?

Tip 2: Comment your regexes

Comments are really important for the readability of code. Good comments should explain why the code was written this way – what problem was it trying to solve?

This is helpful for many reasons. It helps us understand what the code is doing, why it might make some non-obvious choices, and helps to spot bugs. If we know what the code was supposed to do, and it does something different, we know there’s a problem. We can’t do that with uncommented code.

Regexes are a form of code, and should be commented as such. I like to have an overall comment that explains the overall purpose of the regex, as well as individual comments for the broken-down parts of the regex. Here’s what I’d write for our example:

# Regex for matching version strings of the form vXX.YY.ZZa, where
# everything except the major version XX is optional, and the final
# letter can be any character a-z.
#
# Examples: 1, v1.0, v1.0.2, v2.0.3a, 4.0.6b
VERSION_REGEX = (
    r'^v?'                          # optional leading v
    r'([0-9]+)'                     # major version number
    r'('
        r'\.([0-9]+)'               # minor version number
        r'('
            r'\.([0-9]+[a-z]?)'     # micro version number, plus
                                    # optional build character
        r')?'
    r')?$'
)

As I was writing these comments, I actually spotted a mistake in my original regex – I’d forgotten the ? for the optional final character.

With these comments, it’s easy to see exactly what the regex is doing. We can see what it’s trying to match, and jump to the part of the regex which matches a particular component. This makes it easier to do small tweaks, because you can go straight to the fragment which controls the existing behaviour.

So now we can read the regex. How do we get information out of it?

Tip 3: Use non-capturing groups.

The parentheses throughout my regex are groups. These are useful for organising and parsing information from a matching string. In this example:

  • The groups for minor and micro version numbers are followed by a ? – the dot and the associated number are both optional. Putting them both in a group, and making them optional together, means that v2 is a valid match, but v2. isn’t.

  • There’s a group for each component of the version string, so I can get them out later. For example, given v2.0.3b, it can tell us that the major version is 2, the minor version is 0, and the micro version is 3b.

In Python, we can look up the value of these groups with the .groups() method, like so:

>>> import re
>>> m = re.match(VERSION_REGEX, "v2.0.3b")
>>> m.groups()
('2', '.0.3b', '0', '.3b', '3b')

Hmm.

We can see the values we want, but there are a couple of extras. We could just code around them, but it would be better if the regex only captured interesting values.

If you start a group with (?:, it becomes a non-capturing group. We can still use it to organise the regex, but the value isn’t saved.

I’ve changed two groups to be non-capturing in our example:

# Regex for matching version strings of the form vXX.YY.ZZa, where
# everything except the major version XX is optional, and the final
# letter can be any character a-z.
#
# Examples: 1, v1.0, v1.0.2, v2.0.3a, 4.0.6b
NON_CAPTURING_VERSION_REGEX = (
    r'^v?'                          # optional leading v
    r'([0-9]+)'                     # major version number
    r'(?:'
        r'\.([0-9]+)'               # minor version number
        r'(?:'
            r'\.([0-9]+[a-z]?)'     # micro version number, plus
                                    # optional build character
        r')?'
    r')?$'
)

Now when we extract the group values, we’ll only get the components that we’re interested in:

>>> m = re.match(NON_CAPTURING_VERSION_REGEX, "v2.0.3b")
>>> m.groups()
('2', '0', '3b')
>>> m.group(2)
'0'

Now we’ve cut out the noise, and we can access the interesting values of the regex. Let’s go one step further.

Tip 4: Always use named capturing groups

What does m.group(2) mean? It’s not very obvious, unless I have the regex that m was matching against. When reading code, it can be difficult to know what the value of a capturing group means.

And suppose I later change the regex, and insert a new capturing group before the end. I now have to renumber anywhere I was getting groups with the old numbering scheme. That’s incredibly fragile.

There’s a reason we use text, not numbers, to name variables in our programs. If a variable has a descriptive name, the code is much easier to read, because we know what the variable “means”. And when we’re writing code, we’re much less likely to get variables confused.

The same logic should apply to regexes.

Many regex parsers now support named capturing groups. You can supply an alternative name for looking up the value of a group. In Python, the syntax is (?P<name>...) – it varies slightly from language to language.

If we add named groups to our expression:

# Regex for matching version strings of the form vXX.YY.ZZa, where
# everything except the major version XX is optional, and the final
# letter can be any character a-z.
#
# Examples: 1, v1.0, v1.0.2, v2.0.3a, 4.0.6b
NAMED_CAPTURING_VERSION_REGEX = (
    r'^v?'                                # optional leading v
    r'(?P<major>[0-9]+)'                  # major version number
    r'(?:'
        r'\.(?P<minor>[0-9]+)'            # minor version number
        r'(?:'
            r'\.(?P<micro>[0-9]+[a-z]?)'  # micro version number, plus
                                          # optional build character
        r')?'
    r')?$'
)

We can now look up the attributes by name, or indeed access the entire collection with the groupdict attributed.

>>> m = re.match(NAMED_CAPTURING_VERSION_REGEX, "v2.0.3b")
>>> m.groups()
('2', '0', '3b')
>>> m.group('minor')
'0'
>>> m.groupdict()
{'major': '2', 'micro': '3b', 'minor': '0'}

If I look up a group with m.group('minor'), it’s much clearer what it means. And if the underlying regex ever changes, the lookup is fine as-is. Named capturing groups make our code much more explicit and robust.

Conclusion

The tips I’ve suggested – significant whitespace, comments, using descriptive names – are useful, but they’re hardly revolutionary. These are all hallmarks of good code.

Regexes are often allowed to bypass the usual metrics of code quality. They sit as black boxes in the middle of a codebase, monolithic strings that look complicated and scary. If you treat regexes as code, rather than magic, you end up breaking them down, and making them more readable. The result is always an improvement.

Regexes don’t have to be scary. Just treat them as another piece of code.


  1. Validating email addresses is a problem that you probably shouldn’t try to solve with regexes. Usually you want to know that the user has access to the address, not just that it’s correctly formatted. To check that, you need to actually send them an email – which ensures it’s valid at the same time. ↩︎


Get images from the iTunes/App/Mac App Stores with Alfred

Several weeks ago, Dr. Drang posted a Python script for getting artwork from the iTunes Store. It uses the iTunes API, which is super handy – I’d never even known it existed. I still rip a fair amount of music from CDs, and having artwork from iTunes is nice. (A script is a better approach than “buy one song from the album, get the artwork”, which is what I used to do.)

Thing is, I do most of my web searches through Alfred. I don’t really want to go out to the command-line for this one task. Wouldn’t it be nice if I could get iTunes artwork through Alfred?

Hmm.

Calling a script is a fairly simple Alfred workflow. I created a keyword input for “ipic”, which requires an argument, and then that argument is passed to the “Run Script” action. That action has a single-line Bash script: calling out to Dr. Drang’s script, passing my input as a command-line argument.

This works fine with Dr. Drang’s original script.

Unfortunately, Alfred passes the entire search as a single string. Although the original script has flags for filtering by content type (e.g. album, film, TV show), you can’t use that filtering in Alfred – the script only ever sees a single argument.

So I tweaked the script to add a special case for Alfred. When Alfred calls the script, it passes an undocumented --alfred flag. Although docopt is nominally passing the command-line flags, it doesn’t know about this one. Instead, I intercept the flags before docopt sees them, and rearrange them if I detect the script is being called by Alfred:

if sys.argv[1] == '--alfred':
    media_type, search_term = sys.argv[2].split(' ', 1)
    if media_type in ('ios', 'mac', 'album', 'film', 'tv', 'book', 'narration'):
        sys.argv = [sys.argv[0], '--{0}'.format(media_type), search_term]
    else:
        sys.argv = [sys.argv[0], sys.argv[2]]

By the time docopt is called, the arguments look as if I called the script from the command-line. It never knows the difference.

This change, along with long-name flags and writing to a tempfile instead of the Desktop, are in my GitHub fork of Dr. Drang’s original script.


Exclusively create a file in Python 3

I’ve been tidying up a lot of old Python code recently, and I keep running into this pattern:

if not os.path.exists('newfile.txt'):
    with open('newfile.txt', 'w') as f:
        f.write('hello world')

The program wants to write some text to this file, but only if nobody’s written to it before – they don’t want to overwrite the existing contents. This approach is very sensible: if we check that the file exists before writing, we can avoid scribbling over a pre-existing file.

But this code is subject to a race condition: if the file pops into existence between the if and the open(), we scribble all over it anyway.

To catch this race condition, Python 3.3 added a new file mode: x for exclusive creation. If you open a file in mode x, the file is created and opened for writing – but only if it doesn’t already exist. Otherwise you get a FileExistsError.

Here’s how I’d rewrite the snippet above:

try:
    with open('newfile.txt', 'x') as f:
        f.write('hello world')
except FileExistsError:
    print('File already exists.  Clean up!')

Using the x mode means you can be sure that you won’t override an existing file. It’s safer than the existence check.

I probably won’t use this a lot, but when I do, I’ll appreciate it. This has been my general experience with Python 3: there’s no killer feature that I can’t live without, just a growing pile of small niceties that I miss when I go back to Python 2.


Backup paranoia

By now, you’ve probably read about the KeRanger ransomware. Ransomware is not a new idea, but this is the first time it’s come to the Mac. It if works as described, it’s a nasty piece of work. And if you read the same articles as me, you saw comments like If you don’t have backups, you deserve what you get.1

It’s important to keep good backups, but they’re not foolproof. In this case, I’m not sure backups would always save you.

Claud Xiao and Jin Chen, two security researchers, have worked out what the malware does:

After connecting to the C2 server and retrieving an encryption key, the executable will traverse the “/Users” and “/Volumes” directories, encrypt all files under “/Users”, and encrypt all files under “/Volumes” which have certain file extensions.

The “/Volumes” directory is where OS X mounts disks (both external and internal). It includes “Macintosh HD” and any external drives you have mounted. If your backup drives were mounted when the ransomware got to work, they’d be no help at all.

My backup regime has extra steps that I always thought were paranoid, but now I’m not so sure. Here are a few of my suggestions:

  • Only mount your backup drives when you need them.

    If your backup drive is permanently mounted, then it’s always exposed to problems on your computer. There’s a much higher risk of accidental data corruption, malware or random OS bugs. If you only mount the drives when backups are running, it’s much less exposed.

    I have scripts that auto-mount my drives before my nightly backups start, and auto-eject them when they finish. Most of the time, they’re not mounted.

    This gives you extra time. When something goes wrong, you’ve got a chance to spot it and take action – before it propagates to the backups.

  • Keep an offsite backup that’s hard to modify.

    It’s good to have a backup that’s completely isolated, so anything that goes wrong with your computer cannot possibly affect it. Keep a copy of your data on a drive that’s outside the house – it’s safe from your computer, and from environmental problems like theft or a fire. This is an offsite backup.

    I have two offsite backups: an external drive that I keep at the office, and online backups with Crashplan. The latter is particularly nice, because it stores old versions of every file. Even if files do get corrupted or encrypted, I can always roll back to a known good version.

  • When you go travelling, don’t just leave your computer running.

    If you’re at home when something goes wrong, you have options. You can triage. Diagnose. Work out if you’re affected. If needs be, you can pull the plug (literally). That’s much harder if you’re away from the house, perhaps impossible.

    So ask yourself: should I leave my computer on while I’m away? If it’s not doing anything useful, turn it off. And if you have to keep it running, does it need network access?

  • Disconnect your backups when you’re away from home.

    If you do have to leave your computer running while you’re away, you don’t need up-to-date backups – very little is changing. Unmount and unplug your backup drives, so they’re protected from any problems in your absence.

Nothing is watertight – you could do everything above, and just get unlucky. Data loss happens to the best of us.

But what these suggestions get you is extra time when you have problems. When you’re in a rush, you can panic and make mistakes. In a crisis, having time to breathe and think is invaluable.


  1. This was mixed with the idea that BitTorrent is only used for piracy, which means your computer is fair game for malware authors. I’m not interested in that discussion (at least not today). ↩︎


How I use TextExpander to curb my language

I saw a tweet yesterday that I really liked:

I’m on my own personal quest to banish the word “simply” from all instructional content. Saying “simply” doesn’t make it simple.
Keri Maijala (@clamhead) Feb 22 2016 11:09 AM

I’ve been making a concerted effort to cut down on this sort of phrasing as well.

Easily” was my personal weak spot - when writing instructions, I would say “you can easily do X”. That’s little comfort to a reader who is trying (and failing) to do X. Similar words are “just” and “clearly”

I have some TextExpander snippets to help train me out of these bad habits. Whenever I type a word like “easily”, it gets replaced with “easily?”. The extra question mark forces me to think – is that word appropriate here?

Often, the answer is yes, so I delete the question mark and carry on typing. But just having that momentary pause is enough. I can no longer get away with slipping it in automatically, without thinking – I have to justify it every time.

(I don’t think this is an original idea, but I can’t remember where I heard it first. Sorry if that was you!)

And this isn’t just for condescending language. You can use this to help reduce any sort of word you want to cut out of your writing. A particularly important one for me is ableist language. I’m very liable to write this without really thinking. Now I have to check myself every time I use it.

This won’t fix your writing overnight. You’ll still write problematic phrases, but it’s a good way to start training yourself out of it. I recommend trying it.


← Older Posts