Building a northern lights notifier in Python

Because I'm sick of missing them

Thu June 08, 2023

tech

The origin story

A few months ago, my little brother Kevin and I took a trip to the Upper Peninsula of Michigan. It was his week off for spring break, and we wanted to go somewhere secluded where we'd be able to see stars at night. We drove up on a Friday and got in around 5 or 6 PM.

The BnB host left us a note with some standard information, but tacked along at the end was the following message:

P.S. Last night was the best northern lights I've ever seen. More predicted tonight!

For those of you who aren't great with foreshadowing, I won't leave you guessing: There was no great northern lights show that night. I felt a mix of excitement and frustration; we were so close, but just a day late. I suppose that's much of the allure of the northern lights— you can never guarantee you'll see them, even if the conditions seem great.

This is what I saw that next night. Pretty, for sure, but not a crazy lights show

Well, allure is cool and all, but I'd rather see the northern lights under slightly-lamer circumstances than never see them at all. To that end, I looked into setting up a notifier to text me when they're active. I wrote it in Rust as a first-time Rust project, re-read the code, felt sick, and re-wrote it in Python. It's posted on Github, for those who are interested.

Background on the code

I've spent a good amount of time searching the internet for a reliable northern lights predictor. After using the University of Alaska Fairbanks site— which is always way too optimistic, in my experience— I've settled on NOAA's forecast as the only reliable source. As far as I can tell, only NOAA's outlook is informed by actual satellite data. Other sites rely on recent trends and some voodoo about the revolution of the Sun.

Great, so we've got our source. Instinctively, I opened the dev tools in my browser and checked out the network tab. I figured maybe NOAA is hitting some API for forecast data that's sent back as JSON. Wishful thinking. The NOAA site doesn't display any of the forecast data as text; instead, it's shown in a 24 hour running gif. It logically follows that the bytes requested by the site would be images to create the gif. Here's what I saw in the network tab:

[
  {
    "url": "/images/animations/ovation/north/aurora_N_2023-06-10_0330.jpg", 
    "time_tag": "2023-06-10T03:30:00Z"
  }, 
  {
    "url": "/images/animations/ovation/north/aurora_N_2023-06-10_0335.jpg", 
    "time_tag": "2023-06-10T03:35:00Z"
  }
]

Turns out the NOAA site first requests data— in JSON format— about the URLs and timestamps of the forecast images for the last 24 hours. It then loads all of these images in order to play the gif.

My dream of easy, accessible JSON-formatted aurora data was crushed. Still, there had to be a way to scrape together something useful here. After further investigation, I noticed that each image in the gif includes what appears to be a measure of geomagnetic activity ranging from 5 to 200. There's a new image in the gif every 5 minutes, so pulling that "activity" number from the most recent frame would give a useful estimate of aurora activity. The only hard part would be getting the number from the image...

For this, I turned to Tesseract, an open-source OCR (optical character recognition) engine formerly developed by Google. For those who prefer English, it's a way to extract text from images. Just what we need.

Putting the parts together, I wrote a Python script that does the following:

Every 5 minutes, pull the latest aurora image from NOAA
Convert the "activity" portion of the image to text
If the text is above some hard-coded threshold, send me a text

For step 2, after some experimentation I found I achieved better results from Tesseract when I cropped the image to just the "activity" text. This was easy enough to do programmatically, and the same bounding box works for every image. Of course, Tesseract is far from perfect, and sometimes it returns nonsense instead of the correct float value. That's fine— we're checking every 5 minutes, so we can throw out some frames and not miss the bigger picture ^† .

A quick code walkthrough

Let's start with the scheduler. This script relies on the wonderful Schedule library. Here's what the main loop looks like:

if __name__ == "__main__":
  schedule.every(5).minutes.do(check_aurora)
  schedule.every(1).day.do(send_uptime_text)

  while True:
      schedule.run_pending()
      time.sleep(1)

Pretty straightforward. Every 5 minutes, call the 'check_aurora' function. Every day, send an uptime text so I know things are running correctly. Loop indefinitely.

check_aurora mostly delegates and catches errors that might bubble up (from where we attempt to parse a maybe-float-maybe-nonsense string). There are two other functions that hold most of the interesting logic. We'll start in logical order with fetch_latest_image:

def fetch_latest_image():
  """
  Returns the binary for the most recent Aurora image
  """
  images = requests.get(IMAGES_URL).json()
  latest_image_url = "{}{}".format(BASE_URL, images[-1]["url"])
  resp = requests.get(latest_image_url)
  return Image.open(BytesIO(resp.content))

IMAGES_URL is a constant hard-coded URL from which we can get the JSON data about the list of images that are stitched together to form the gif on NOAA's site. We hit this service, grab the last/most-recent image in the list, and convert the bytes to an Image, which we'll process in the next step.

def try_read_aurora(image: Image):
    """
    Args:
        - image: Image -- the PIL image to process
    Returns:
        strength: float -- the strength extracted from the image
    Throws:
        ValueError -- allows ValueError to bubble up if strength
        extracted from image is not cast-able to a float.
    """
    cropped_image = image.crop((565, 20, 633, 46))
    grayscaled = ImageOps.grayscale(cropped_image)
    strength = pytesseract.image_to_string(grayscaled)
    return try_parse_strength(strength)

Here, we take the image as an argument. Next, we crop it to just the area we care about, then grayscale it. We then use pytesseract to convert that processed image to text. Finally, we return an attempt to parse that text into a proper Python float.

This code is the heart of the script. There's also some logic to track the max aurora strength and time at which it occurred over a 12-hour window. This is to ensure I don't get blasted with texts if the aurora strength stays above the threshold all night. Instead, texts are only sent when the strength 1) first crosses the threshold or 2) increases substantially beyond the 12-hour max.

Otherwise, the rest of the script is comprised of helper functions and auxiliary logic.

Running the code yourself

To run this script on your own device (or in the cloud, or whatever) ^‡ , the prerequisites are:

Python 3 plus all libraries in the 'requirements.txt'
Tesseract, plus its dependencies
A Twilio account plus credentials populated in a .env file or a Gmail account plus credentials populated in a .env file ^§
Download the repo and run via 'python3 main.py'

Possible improvements

To be clear, there's a lot one could do to beef up this script.

I set an arbitrary text-worthy threshold for what's relevant where I live; this could be more exact and extensible.
Time of day and weather could both be taken into account— there's probably no need to even check the forecast if it's 11am (can't see the northern lights during the day) or totally overcast out. Integrating with a weather API and adding some simple rules for when to ping NOAA's forecast would be quick ways to improve this script.
A UI + support for multiple text recipients would be cool.
Running some proper tests on Tesseract's accuracy and possibly training it on old images to reduce the number of "toss out" frames ^†

For anyone who wants to contribute, PRs are welcome! For now, though, I'm happy with this little script. I hope it helps someone out there catch a glimpse of a beautiful lights show!

† Update: Having run this script for a while, I noticed that Tesseract would go through stretches of incorrectly classifying up to a dozen or so images. After researching how to improve Tesseract's accuracy— ideally without re-training the model, since that looked like a lot of work— I learned (and confirmed empirically) that grayscaling the image greatly increases the likelihood of a correct classification.

I haven't done much statistical analysis on accuracy boost, but I did update the codebase to accept a flag that triggers a test run for a certain number of images. This involves fetching the most recent x images, then processing and classifying them first without grayscaling, then with grayscaling. Anecdotally, I've seen stretches of 10 un-optimized images fail in a row, whereas I haven't seen a single optimized image be incorrectly classified yet.

‡ Update: In order to have this script running all the time, without needing to rely on my laptop being connected to internet or needing to pay for cloud costs, I set it up as a background process running on a single-board computer. It's been a fun project and an easy way to dip my toes into working with IoT devices— highly recommend for anyone who's interested.

§ Update: Twilio changed their policy regarding toll-free numbers, so I was going to need to get a new number. This also involved filling out forms, and honestly Twilio is kind of annoying and there's no need to pay for it when I can use email instead, so I switched to an SMTP setup. I didn't want to use Gmail's SMTP server, but it was the easiest option, so I added optional GPG encryption (via an env var) for privacy :)

Drop me a line