I've recently been interested in training a simple neural network to recognize numerical digits so I can better understand how to create neural networks.

A dataset people often use for digit recognition is the MNIST handwritten dataset, which contains 70,000 examples. However, the data provided is in a 'IDX' file format.

Seeing as JSON is the data format lingua-franca of our time, I wrote up a quick script to convert the data into a JSON format, which can be seen at this repository:

https://github.com/lorenmh/mnist_handwritten_json

There are two scripts in that repository which do the file conversion:

  1. process.sh is a BASH script which fetches the IDX files, decompresses them, calls the convert_to_json.py script and compresses the JSON files.
  2. convert_to_json.py is a Python script which parses the IDX files, and combines the image and label files into Python dictionaries, then dumps these dictionaries as JSON files.

How Do I Download This Data?

To download the dataset you can curl it from github with BASH:

$ curl -LO https://github.com/lorenmh/mnist_handwritten_json/raw/master/mnist_handwritten_test.json.gz
$ curl -LO https://github.com/lorenmh/mnist_handwritten_json/raw/master/mnist_handwritten_train.json.gz

To decompress the files call gunzip:

$ gunzip *.gz

process.h

#!/bin/bash
cd "${0%/*}"

TRAIN_IMG=train_img
TRAIN_LBL=train_lbl
TEST_IMG=test_img
TEST_LBL=test_lbl

ODIR=data

mkdir $ODIR 2>/dev/null

echo 'Fetching files'
curl http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz -o $ODIR/$TRAIN_IMG.ubyte.gz
curl http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz -o $ODIR/$TRAIN_LBL.ubyte.gz
curl http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz -o $ODIR/$TEST_IMG.ubyte.gz
curl http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz -o $ODIR/$TEST_LBL.ubyte.gz

echo 'Deflating files'
gunzip data/*

python3 convert_to_json.py

echo 'Compressing JSON files'
gzip -k *.json

convert_to_json.py

#!/usr/local/bin/python3
'''
Loren Howard - 1/13/2018
The file format is documented here:
    http://yann.lecun.com/exdb/mnist/
This script simply converts the files from http://yann.lecun.com/exdb/mnist/
into a simple JSON format;
The outputted JSON file is an array of objects with two fields:
    image: an array with 784 0-255 pixel values (28*28*1byte image)
    label: the corresponding label for this image
'''

import json
import struct

UNPACK = (('data/test_img.ubyte', 'data/test_lbl.ubyte', 'mnist_handwritten_test.json'),
          ('data/train_img.ubyte', 'data/train_lbl.ubyte', 'mnist_handwritten_train.json'))

IMG_HEADER_FMT = '>IIII'
IMG_HEADER_SZ = 16

LBL_HEADER_FMT = '>II'
LBL_HEADER_SZ = 8

LBL_FMT = 'B'
LBL_SZ = 1

JSON_INDENT = 2

def struct_unpack_file(struct_fmt, struct_sz, f):
    while True:
        bytes = f.read(struct_sz)
        if not bytes:
            break
        yield struct.unpack(struct_fmt, bytes)

def unpack(img_fname, lbl_fname, o_fname):
    print('Unpacking %s and %s and outputting as %s' % (img_fname, lbl_fname,
                                                        o_fname))
    img_file = open(img_fname, 'rb')
    lbl_file = open(lbl_fname, 'rb')

    img_header = img_file.read(IMG_HEADER_SZ)
    lbl_header = lbl_file.read(LBL_HEADER_SZ)

    _, num_img, num_row, num_col = struct.unpack(IMG_HEADER_FMT, img_header)
    _, num_lbl =                   struct.unpack(LBL_HEADER_FMT, lbl_header)

    if num_img != num_lbl:
        raise ValueError('number of labels != number of images')

    img_sz = num_row * num_col
    img_fmt = 'B' * img_sz

    img_gen = struct_unpack_file(img_fmt, img_sz, img_file)
    lbl_gen = struct_unpack_file(LBL_FMT, LBL_SZ, lbl_file)

    lst = [{'image': img, 'label': lbl} for img,[lbl] in zip(img_gen, lbl_gen)]

    o_file = open(o_fname, 'w')
    json.dump(lst, o_file, indent=JSON_INDENT)

    img_file.close()
    lbl_file.close()
    o_file.close()

for args in UNPACK:
    unpack(*args)