Clearing disk space on OS X

Over the weekend, I’ve been trying to clear some disk space on my Mac. I’ve been steadily accumulating lots of old and out-of-date files, and I just wanted a bit of a spring clean. Partly to get back that the disk space, partly so I didn’t have to worry about important information files that might be getting lost in the noise.

Over the course of a few hours, I was able to clean up over half a terabyte of old files. This wasn’t just trawling through the Finder by hand – I had a couple of tools and apps to help me do this – and I thought it would be worth writing down what I used.

Backups: Time Machine, SuperDuper! and CrashPlan

Embarking on an exercise like this without good backups would be foolhardy: what if you get cold feet, or accidentally delete a critical file? Luckily, I already have three layers of backup:

  • Time Machine backups to a local hard drive
  • Overnight SuperDuper! clones to a local hard drive
  • Backups to the CrashPlan cloud

The Time Machine backups go back to last October, the CrashPlan copies further still. I haven’t looked at the vast majority of what I deleted in months (some stuff years), so I don’t think I’ll miss it – but if I change my mind, I’ve got a way out.

For finding the big folders: DaisyDisk

DaisyDisk can analyse a drive or directory, and it presents you with a pie chart like diagram showing which folders are taking up the most space. You can drill down into the pie segments to see the contents of each folder. For example, this shows the contents of my home directory:

This is really helpful for making big space savings – it’s easy to see which folders have become bloated, and target my cleaning accordingly. If you want quick gains, this is a great app.

It’s also fast: scanning my entire 1TB boot drive took less than ten seconds.

For finding duplicate files: Gemini

Once I’ve found a large directory, I need to decide what (if anything) I want to delete. Sometimes I can look for big files that I know I don’t want any more, and move them straight to the Trash. But the biggest waste of space on my computer is multiple copies of the same file. Whenever I reorganise my hard drive, files get copied around, and I don’t always clean them up.

Gemini is a tool that can find duplicate files or folders within a given set of directories. For example, running it over a selection of my virtualenvs:

Once it’s found the duplicates, you can send files straight to the Trash from within the app. It has some handy filters for choosing which dupes to drop – oldest, newest, from within a specific directory – so doing so is pretty quick.

This is another fast way to reclaim space: deleting dupes saves space, but doesn’t lose any information.

Gemini isn’t perfect: it gets slow when scanning large directories (100k+ files), and sometimes it would miss duplicates. I often had to run it several times before it had found out all of the dupes in a directory. Note that I’m only running v1: these problems may be fixed in the new version.

File-by-file comparisons: Python’s filecmp module

Sometimes I wanted to compare a couple of individual files, not an entire directory. For this, I turned to Python’s filecmp module. This module contains a number of functions for comparing files and directories. This let me write a shell function for doing the comparisons on the command-line (this is the fish shell):

function filecmp
    python -c "import filecmp; print(filecmp.cmp('''$argv[1]''', '''$argv[2]'''))"

Fish drops in the two arguments to the function as $argv[1] and $argv[2]. The -c flag tells Python to run a command passed in as a string, and then it’s printing the result of calling filecmp.cmp() with the two files – True if they match, False if they don’t.

I’m using triple-quoted strings in the Python, so that filenames containing quote characters don’t prematurely terminate the string. I could still be bitten by a filename that contains a triple quote, but that would be very unusual. And unlike Python, where quote characters are interchangeable, it’s important that I use double-quotes for the string in the shell: shells only expand variables inside double-quoted strings, not single-quoted strings.

Usage is as follows:

$ filecmp hello.txt hello.txt

$ filecmp hello.txt foo.txt

I have this in my fish config file, so it’s available in any Terminal window. If you drag a file from the Finder to the Terminal, it auto-inserts the full path to that file, so it’s really easy to do comparisons – I type filecmp, and then drag in the two files I want to compare.

This is great if I only want to compare a few files at a time. I didn’t use it much on the big clean, but I’m sure it’ll be useful in the future.

Photos library: Duplicate Photos Cleaner

Part of this exercise was trying to consolidate my photo library. I’ve tried a lot of tools for organising my photos – iPhoto, Aperture, Lightroom, a folder hierarchy – and so photos are scattered across my disk. I’ve settled on using iCloud Photo Library for now, but I still had directories with photos that I hadn’t imported.

When I found a directory with new pictures, I just loaded everything into Photos. It was faster than cherry-picking the photos I already didn’t have, and ensures I didn’t miss anything – but of course, it also ensures that I import any duplicates.

Once I’d finished importing photos from the far corners of my disk, I was able to use this app to find duplicates in my photo library, and throw them away. It scans your entire Photo Library (it can do iPhoto and Photo Booth as well), and moves any duplicates to a dedicated album, for you to review/delete at will.

I chose the app by searching the Mac App Store; there are plenty of similar apps, and I don’t know how this one compares. I don’t have anything to particularly recommend it compared to other options, but it found legitimate duplicates, so it’s fine for my purposes.

Honourable mentions: find, du and df

There were a couple of other command-line utilities that I find useful.

If I wanted to find out which directories contain the most files – not necessarily the most space – I could use find. This isn’t about saving disk space, it’s about reducing the sheer number of unsorted files I keep. There were two commands I kept using:

  • Count all the files below the current directory: both files in this directory, and all of its subdirectories.
    $ find . | wc -l
  • Find out which of the subdirectories of the current directory contain the most files.
    $ for l in (ls); if [ -d $l ]; echo (find $l | wc -l)"  $l"; end; end
             627  _output
             262  content
              31  screenshots
               3  talks
              33  theme
              11  util

These two commands let me focus on processing directories that had a lot of files. It’s nice to clear away a large chunk of these unsorted files, so that I don’t have to worry about what they might contain.

And when I’m using Linux, I can mimic the functions of DaisyDisk with df and du. The df (display free space) command lets you see how much space is free on each of my disk partitions:

$ df -h
Filesystem      Size   Used  Avail Capacity   iused     ifree %iused  Mounted on
/dev/disk2     1.0Ti  295Gi  741Gi    29%  77290842 194150688   28%   /
devfs          206Ki  206Ki    0Bi   100%       714         0  100%   /dev
map -hosts       0Bi    0Bi    0Bi   100%         0         0  100%   /net
map auto_home    0Bi    0Bi    0Bi   100%         0         0  100%   /home
/dev/disk3s4   2.7Ti  955Gi  1.8Ti    35% 125215847 241015785   34%

And du (display disk usage) lets me see what’s using up space in a single directory:

$ du -hs *
 24K    experiments
 32K    favicon-a.acorn
 48K    favicon.acorn
 24K    style
 56K    templates
 40K    touch-icon-a.acorn

I far prefer DaisyDisk when I’m on the Mac, but it’s nice to have these tools in my back pocket.

Closing thought

These days, disk space is cheap (and even large SSDs are fairly affordable). So I don’t need to do this: I wasn’t running out of space, and it would be easy to get more if I was. But it’s useful for clearing the noise, and finding old files that have been lost in the bowels of my hard drive.

I do a really big cleanup about once a year, and having these tools always makes me much faster. If you ever need to clear large amounts of disk space, I’d recommend any of them.

A two-pronged iOS release cycle

One noticeable aspect of this year’s WWDC keynote was a lack of any new features focused on the iPad. Federico Viticci has written about this this on Mac Stories, in which he said:

I wouldn’t be surprised to see Apple move from a monolithic iOS release cycle to two major iOS releases in the span of six months – one focused on foundational changes, interface refinements, performance, and iPhone; the other primarily aimed at iPad users in the Spring.

I think this is a very plausible scenario, and between iOS 9.3 and WWDC, it seems like it might be coming true. Why? Education.

Apple doesn’t release breakdowns, but a big chunk of iPad sales seems to come from the education market. Education runs to a fixed schedule: the academic year starts in the autumn, continues over winter and spring, with a long break in the summer. A lot of work happens in the summer break, which includes lesson plans for the coming year, and deploying new tech.

The traditional iOS release cycle – preview the next big release at WWDC, release in the autumn – isn’t great for education. By the time the release ships, the school year is already underway. That can make it difficult for schools to adopt new features, often forcing them to wait for the next academic year.

If you look at the features introduced in iOS 9.3 – things like Shared iPad, Apple School Manager, or Managed Apple ID – these aren’t things that can be rolled out mid-year. They’re changes at the deployment stage. Once students have the devices, it’s too late. Even smaller things, like changes to iTunes U, can’t be used immediately, because they weren’t available when lesson plans were made over the summer. (And almost no teachers are going to run developer previews.)

This means there’s less urgency to get education-specific iPad features into the autumn release, because it’s often a year before they can be deployed. In a lot of cases, deferring that work for a later release (like with iOS 9.3) doesn’t make much of a difference for schools. And if you do that, it’s not a stretch to defer the rest of the iPad-specific work, and bundle it all into one big release that focuses on the iPad. Still an annual cycle, but six months offset from WWDC.

Moving to this cycle would have other benefits. Splitting the releases gives Apple more flexibility: they can spread the engineering work across the whole year, rather than focusing on one massive release for WWDC. It’s easier to slip an incomplete feature if the next major release is six months away, not twelve. And it’s a big PR item for the spring, a time that’s usually quiet for official Apple announcements.

I don’t know if this is Apple’s strategy; I’m totally guessing. But it seems plausible, and I’ll be interested to see if it pans out into 2017.

A subscription for my toothbrush 

Last weekend, I had my annual dental check-up. Thankfully, everything was fine, and my teeth are in (reasonably) good health.

While I was at the dentist, I had the regular reminder to change my toothbrush on a regular basis. You’re supposed to replace your toothbrush every three months or so: it keeps the bristles fresh and the cleaning effective. If left to my own devices, I would probably forget to do this.

To help me fix this, I buy my toothbrushes through Amazon’s “Subscribe & Save” program. I have a subscription to my toothbrushes: I tell Amazon what I want to buy, and how often I want it, then they remember when to send me a package. So every six months, I get a new pack of brushes.

Amazon always email you a few days before they place they send you the next subscription, so there’s a chance to cancel it if it’s no longer relevant. It doesn’t come completely out of the blue. And there’s a place for managing your subscriptions on your account page.

I’m sure I could remember to buy toothbrushes myself if I put my mind to it, but it’s quite nice that Amazon will do it for me. It’s just one less thing for me to think about.

Reading web pages on my Kindle

Of everything I’ve tried, my favourite device for reading is still my e-ink Kindle. Long reading sessions are much more comfortable on a display that isn’t backlit.

It’s easy to get ebooks on my Kindle – I can buy them straight from Amazon. But what about content I read on the web? If I’ve got a long article on my Mac that I’d like to read on my Kindle instead, how do I push it from one to the other?

There’s a Send to Kindle Mac app available, but that can only send documents on disk. I tried it a few times – save web pages to a PDF or HTML file, then send them to my Kindle through the app – but it was awkward, and the quality of the finished files wasn’t always great. A lot of web pages have complex layouts, which didn’t look good on the Kindle screen.

But I knew the folks at Instapaper had recently opened an API allowing you to use the same parser as they use in Instapaper itself. You give it a web page, and it passes back a nice, cleaned-up version that only includes the article text. Perhaps I could do something useful with that?

I decided to write a Python script that would let me send articles to a Kindle – from any device.

Continue reading →

Introduction to property-based testing 

On Tuesday night, I was talking about testing techniques at the Cambridge Python User Group (CamPUG for short). I was talking primarily about property-based testing and the Python library Hypothesis, but I included an introduction to the ideas of stateful testing, and fuzz testing with american fuzzy lop (afl).

I was expecting the traditional property-based testing would be the main attraction, and the stateful and fuzz testing would just be nice extras. In fact, I think interest was pretty evenly divided between the three topics.

I’ve posted the slides and my rough notes. Thanks to everybody who came on Tuesday – I had a really good evening.

Finding 404s and broken pages in my Apache logs

Sometime earlier this year, I broke the Piwik server-side analytics that I’d been using to count hits to the site. It sat this way for about two months before anybody noticed, which I took as a sign that I didn’t actually need them. I look at them for vanity, nothing more.

Since then, I’ve been using Python to parse my Apache logs, an idea borrowed from Dr. Drang. All I want is a rough view count, and if I work on the raw logs, then I can filter out a lot of noise from things like bots and referrer spam. High-level tools like Piwik and Google Analytics make it much harder to prune your results.

My Apache logs include a list of all the 404 errors: any time that somebody (or something) has found a missing page. This is useful information, because it tells me if I’ve broken something (not unlikely, see above). Although I try to have a helpful 404 page, that’s no substitute for fixing broken pages. So I wrote a script that looks for 404 errors in my Apache logs, and prints the most commonly hit pages – then I can decide whether to fix or ignore them.

The full script is on GitHub, along with some instructions. Below I’ll walk through the part that actually does the hard work.

page_tally = collections.Counter()

for line in sys.stdin:

    # Any line that isn't a 404 request is uninteresting.
    if '404' not in line:

    # Parse the line, and check it really is a 404 request; otherwise,
    # discard it.  Then get the page the user was trying to reach.
    hit = PATTERN.match(line).groupdict()
    if hit['status'] != '404':
    page = hit['request'].split()[1]

    # If it's a 404 that I know I'm not going to fix, discard it.
    if page in WONTFIX_404S:

    # If I fixed the page after this 404 came in, I'm not interested
    # in hearing about it again.
    if page in FIXED_404S:
        time, _ = hit["time"].split()
        date = datetime.strptime(time, "%d/%b/%Y:%H:%M:%S").date()
        if date <= FIXED_404S[page]:

        # But I definitely want to know about links I thought I'd
        # fixed but which are still broken.
        print('!! ' + page)

    # This is a 404 request that we're interested in; go ahead and
    # add it to the counter.
    page_tally[page] += 1

for page, count in page_tally.most_common(25):
    print('%5d\t%s' % (count, page))

I’m passing the Apache log in to stdin, and looping over the lines. Each line corresponds to a single hit.

On lines 6–7, I’m throwing away all the lines that don’t contain the string “404”. This might let through a few lines that aren’t 404 results – I’m not too fussed. This is just a cheap heuristic to avoid (relatively) slow parsing of lots of lines that I don’t care about.

On lines 11–14, I actually parse the line. My PATTERN regex for parsing the Apache log format comes from Dr. Drang’s post. Now I actually can properly filter for 404 results only, and discard the rest. The request parameter is usually something like GET /about/ HTTP/1.1 – a method, a page and an HTTP version. I only care about the page, so throw away the rest.

Like any public-facing computer, my server is crawled by bots looking for unpatched versions of WordPress and PHP. They’re looking for login pages where they can brute force credentials or exploit known vulnerabilities. I don’t have PHP or WordPress installed, so they show up as 404 errors in my logs.

Once I’m happy that I’m not vulnerable to whatever they’re trying to exploit, I add those pages to WONTFIX_404S. On lines 17–18, I ignore any errors from those pages.

The point of writing this script is to find, and fix, broken pages. Once I’ve fixed the page, the hits are still in the historical logs, but they’re less interesting. I’d like to know if the page is still broken in future, but I already know that it was broken in the past.

When I fix a page, I add it to FIXED_404S, a dictionary in which the keys are pages, and the values are the date on which I think I fixed it. On lines 22–32, I throw away any broken pages that I’ve acknowledged and fixed, if they came in before the fix. But then I highlight anything that’s still broken, because it means my fix didn’t work.

Any hit that hasn’t been skipped by now is “interesting”. It’s a 404’d page that I don’t want to ignore, and that I haven’t fixed in the past. I add 1 to the tally of broken pages, and carry on.

I’ve been using the Counter class from the Python standard library to store my tally. I could use a regular dictionary, but Counter helps clean up a little boilerplate. In particular, I don’t have to initialise a new key in the tally – it starts at a default of 0 – and at the end of the script, I can use the most_common() method to see the 404’d pages that are hit most often. That helps me prioritise what pages I want to fix.

Here’s a snippet from the output when I first ran the script:

23656   /atom.xml
14161   /robots.txt
 3199   /favicon.ico
 3075   /apple-touch-icon.png
  412   /wp-login.php
  401   /blog/2013/03/pinboard-backups/

Most of the actually broken or missing pages were easy to fix. In ten minutes, I fixed ~90% of the 404 problems that had occurred since I turned on Apache last August.

I don’t know how often I’ll actually run this script. I’ve fixed the most common errors; it’ll be a while before I have enough logs to make it worth doing another round of fixes. But it’s useful to have in my back pocket for a rainy day.

A Python smtplib wrapper for FastMail

Sometimes I want to send email from a Python script on my Mac. Up to now, my approach has been to shell out to osascript, and use AppleScript to invoke to compose and send the message. This is sub-optimal on several levels:

  • It relies on having up-to-date email config;
  • The compose window of briefly pops into view, stealing focus from my main task;
  • Having a Python script shell out to run AppleScript is an ugly hack.

Plus it was a bit buggy and unreliable. Not a great solution.

My needs are fairly basic: I just want to be able to send a message from my email address, with a bit of body text and a subject, and optionally an attachment or two. And I’m only sending messages from one email provider, FastMail.

Since the Python standard library includes smtplib, I decided to give that a try.

After a bit of mucking around, I came up with this wrapper:

from email import encoders
from email.mime.base import MIMEBase
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
import smtplib

class FastMailSMTP(smtplib.SMTP_SSL):
    """A wrapper for handling SMTP connections to FastMail."""

    def __init__(self, username, password):
        super().__init__('', port=465)
        self.login(username, password)

    def send_message(self, *,
        msg_root = MIMEMultipart()
        msg_root['Subject'] = subject
        msg_root['From'] = from_addr
        msg_root['To'] = ', '.join(to_addrs)

        msg_alternative = MIMEMultipart('alternative')

        if attachments:
            for attachment in attachments:
                prt = MIMEBase('application', "octet-stream")
                prt.set_payload(open(attachment, "rb").read())
                    'Content-Disposition', 'attachment; filename="%s"'
                    % attachment.replace('"', ''))

        self.sendmail(from_addr, to_addrs, msg_root.as_string())

Lines 7–12 create a subclass of smtplib.SMTP_SSL, and uses the supplied credentials to log into FastMail. Annoyingly, this subclassing is broken on Python 2, because SMTP_SSL is an old-style class, and so super() doesn’t work. I only use Python 3 these days, so that’s okay for me, but you’ll need to change that if you want a backport.

For getting my username/password into the script, I use the keyring module. It gets them from the system keychain, which feels pretty secure. My email credentials are important – I don’t just want to store them in an environment variable or a hard-coded string.

Lines 14–19 defines a convenience wrapper for sending a message. The * in the arguments list denotes the end of positional arguments – all the remaining arguments have to be called as keyword arguments. This is a new feature in Python 3, and I really like it, especially for functions with lots of arguments. It helps enforce clarity in the calling code.

In lines 20–23, I’m setting up a MIME message with my email headers. I deliberately use a multi-part MIME message so that I can add attachments later, if I want.

Then I add the body text. With MIME, you can send multiple versions of the body: a plain text and an HTML version, and the recipient’s client can choose which to display. In practice, I almost always use plaintext email, so that’s all I’ve implemented. If you want HTML, see Stack Overflow.

Then lines 29–37 add the attachments – if there are any. Note that I use None as the default value for the attachments argument, not an empty list – this is to avoid any gotchas around mutable default arguments.

Finally, on line 39, I call the sendmail method from the SMTP class, which actually dispatches the message into the aether.

The nice thing about subclassing the standard SMTP class is that I can use my wrapper class as a drop-in replacement. Like so:

with FastMailSMTP(user, pw) as server:
                        to_addrs=['', ''],
                        msg='Hello world from Python!',
                        subject='Sent from smtplib',

I think this is a cleaner interface to email. Mucking about with MIME messages and SMTP is a necessary evil, but I don’t always care about those details. If I’m writing a script where email support is an orthogonal feature, it’s nice to have them abstracted away.

Safely deleting a file called ‘-rf *’

Odd thing that happened at work today: we accidentally created a file called -rf * on one of our dev boxes. Linux allows almost any character in a filename, with the exception of the null byte and a slash, which lets you create unhelpfully-named files like this. (Although you can’t create -rf /.)

You have to be a bit careful deleting a file like that, because running rm -rf * usually deletes everything in the current directory. You could try quoting it – rm "-rf *", perhaps – but you have to be careful to get the quotes right.

On systems with a GUI, you can just use the graphical file manager, which doesn’t care about shell flags. But most of our boxes are accessed via SSH, and only present a command line.

Another possible fix is to rename the file to something which doesn’t look like a shell command, and delete the renamed file. But trying to do mv "-rf *" has the same quoting issues as before.

In the end, I went with an inelegant but practical solution:

$ python -c 'import shutil; shutil.move("-rf *", "deleteme")'

Python doesn’t know anything about these Unix flags, so it renames the file without complaint.

I feel like I should probably know how to quote the filename correctly to delete this without going to Python, but sometimes safety and pragmatism trump elegance. This works, and it got us out of a mildly tricky spot.

Hopefully this is not something many people need to fix, but now it’s all sorted, I can’t help but find the whole thing mildly amusing.

“The document could not be saved”

I try not to make unreasonable complaints about the quality of software. I write software for a living, and writing bug-free software is hard. A trawl through the code I’ve written would reveal many embarrassing or annoying bugs. People in glass houses, etc.

But I do have some basic expectations of the software I use. For example, I expect that my text editor should be able to open, edit, and save files.

A screenshot of a TextMate window showing a dialog "The document 'hello.txt' could not be saved.  Please check Console output for reason."

A screenshot of a TextMate window showing a dialog “The document ‘hello.txt’ could not be saved. Please check Console output for reason.”

Two days ago, TextMate on my laptop decided it didn’t fancy saving files any more. Actually, it decided that ⌘+S should do nothing, and trying to click “Save” in the File menu would throw this error dialog. A trawl of Google suggests that I’m the only person who’s ever hit this error. If so, here’s a quick write-up of what I tried for the next person to run into it.

Continue reading →

Hiding the YouTube search bar

This morning, I got an email from Sam, asking if I had a way to cover up the persistent YouTube search bar:

Three years ago, I wrote a bookmarklet for cleaning up the worst of the Google Maps interface, and we can adapt this to clean up YouTube as well. Unlike that post, this is one I’m likely to use myself. (Writing the Maps bookmarklet was a fun exercise in JavaScript, but I almost always use Google Maps on my phone, so I was never as annoyed by the clutter on the desktop version.)

If we do “Inspect Element” on a YouTube page, we can find the element that contains this search box: <div id="yt-masthead-container">. So we want to toggle the visibility of this element. Since it’s only one item, we can write a much smaller JavaScript snippet for toggling the visibility:

var search_bar = document.getElementById("yt-masthead-container");

// Check if it's already hidden
var hidden = (window.getComputedStyle(search_bar)).getPropertyValue("display");

// Set the visibility based on the opposite of the current state
void( = (hidden == "none" ? "" : "none"));

To use this code, drag this link to your bookmarks bar:

Toggle the YouTube search bar

Simply click it once to make the bar disappear, and click it again to bring it all back.

Something that wasn’t in my original Google Maps bookmarklet is that void() call. It turns out that if a bookmarklet returns a value, it’s supposed to replace the current page with that value. Which strikes me as bizarre, but that’s what Chrome does, so it broke the page. (Safari doesn’t – not sure if that’s a bug or a feature.) The void function prevents that from happening.

This isn’t perfect – content below the bar doesn’t reflow to take up the available space – but the bar no longer hangs over content as you scroll. I think I’ll find this useful when I’m pressed for space on small screens. It’s a bit more screen real-estate I can reclaim. Thanks for the idea, Sam!

← Older Posts