Tensorflow学习第1课——从本地加载MNIST以及FashionMNIST数据

佚名 7年前 (2018-11-21) 人工智能 2277人围观抢沙发百度已收录

很多Tensorflow第一课的教程都是使用MNIST或者FashionMNIST数据集作为示例数据集，但是其给的例程基本都是从网络上用load_data函数直接加载，该函数封装程度比较高，如果网络出现问题，数据集很难实时从网上下载（笔者就多次遇到这种问题，忍无可忍），而且数据是如何解码的也一无所知，不利于后续的学习和理解，因此本文主要介绍对下载到本地的MNIST或FashionMNIST数据集如何加载解析的问题。

下载到本地的数据集一般有两种格式：numpy的压缩格式.npz，以及gzip压缩格式.gz，下面我们分别介绍，在以下介绍中，均假设读者已经将数据集下载到本地了，如果不知道从哪里下载，请百度。

SRE实战互联网时代守护先锋，助力企业售后服务体系运筹帷幄！一键直达领取阿里云限量特价优惠。

npz格式数据集的加载代码非常简单，直接用numpy的load函数即可

import numpy as np

# 假设数据保存在'./datasets/'文件夹下
try:
    data = np.load('./datasets/mnist.npz')
    x_train, y_train, x_test, y_test = data['x_train'],data['y_train'],data['x_test'],data['y_test']
    
    # 可以将其中一条数据保存成txt文件，查看一下，会对这组数据有个直观的感受
    # np.savetxt('test.txt',x_train[0],fmt='%3d',newline='\n\n')
    
    # 将数据归一化
    x_train, x_test = x_train/255.0, x_test/255.0
except Exception as e:
    print('%s' %e)

gz格式数据集的加载

import numpy as np
import os
import gzip

# 定义加载数据的函数，data_folder为保存gz数据的文件夹，该文件夹下有4个文件
# 'train-labels-idx1-ubyte.gz', 'train-images-idx3-ubyte.gz',
# 't10k-labels-idx1-ubyte.gz', 't10k-images-idx3-ubyte.gz'

def load_data(data_folder):

  files = [
      'train-labels-idx1-ubyte.gz', 'train-images-idx3-ubyte.gz',
      't10k-labels-idx1-ubyte.gz', 't10k-images-idx3-ubyte.gz'
  ]

  paths = []
  for fname in files:
    paths.append(os.path.join(data_folder,fname))

  with gzip.open(paths[0], 'rb') as lbpath:
    y_train = np.frombuffer(lbpath.read(), np.uint8, offset=8)

  with gzip.open(paths[1], 'rb') as imgpath:
    x_train = np.frombuffer(
        imgpath.read(), np.uint8, offset=16).reshape(len(y_train), 28, 28)

  with gzip.open(paths[2], 'rb') as lbpath:
    y_test = np.frombuffer(lbpath.read(), np.uint8, offset=8)

  with gzip.open(paths[3], 'rb') as imgpath:
    x_test = np.frombuffer(
        imgpath.read(), np.uint8, offset=16).reshape(len(y_test), 28, 28)

  return (x_train, y_train), (x_test, y_test)

(train_images, train_labels), (test_images, test_labels) = load_data('./datasets/fashion/')

这样，无论是npz格式还是gz格式，都可以轻松加载解码，每次启动测试都没必要从网上下载，增加不必要的麻烦。