:::: MENU ::::

InvalidArgumentError: Assign requires shapes of both tensors to match. や Incompatible shapes でTensorflowの学習が止まる

Pocket

Tensorflow Object Detection APIを使って遊んでいたりするのだが、よく同じようなエラーに悩まされたのでメモを残しておく。

models/running_pets.md at master · tensorflow/models
やっていたことはtutorialとして用意されているペットの種類識別のデータセットを変えて、自分が分類したい画像にするというだけ。

現象

TFRecordの作成は問題なかったが、学習中に以下のようなエラーが出て止まる。
環境はGCPのML Engine。

The replica master 0 exited with a non-zero status of 1. Termination reason: Error. 
Traceback (most recent call last):
  [...]
    groundtruth_classes_with_background_list))
  File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 1421, in _loss_box_classifier
    batch_reg_targets, weights=batch_reg_weights) / normalizer
  File "/root/.local/lib/python2.7/site-packages/object_detection/core/losses.py", line 71, in __call__
    return self._compute_loss(prediction_tensor, target_tensor, **params)
  File "/root/.local/lib/python2.7/site-packages/object_detection/core/losses.py", line 157, in _compute_loss
    diff = prediction_tensor - target_tensor
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 794, in binary_op_wrapper
    return func(x, y, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 2775, in _sub
    result = _op_def_lib.apply_op("Sub", x=x, y=y, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2327, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1226, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Incompatible shapes: [1,63,4] vs. [1,64,4]
	 [[Node: Loss/BoxClassifierLoss/Loss/sub = Sub[T=DT_FLOAT, _device="/job:master/replica:0/task:0/gpu:0"](Loss/BoxClassifierLoss/Reshape_9, Loss/BoxClassifierLoss/stack_4)]]
	 [[Node: total_loss_1_G1426 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/cpu:0", send_device="/job:master/replica:0/task:0/gpu:0", send_device_incarnation=-5926190012419481980, tensor_name="edge_6736_total_loss_1", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/cpu:0"]()]]
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 198, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 194, in main
    worker_job_name, is_chief, FLAGS.train_dir)
  File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 296, in train
    saver=saver)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 776, in train
    master, start_standard_services=False, config=session_config) as sess:
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 960, in managed_session
    self.stop(close_summary_writer=close_summary_writer)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 788, in stop
    stop_grace_period_secs=self._stop_grace_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 386, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 949, in managed_session
    start_standard_services=start_standard_services)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 706, in prepare_or_wait_for_session
    init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 256, in prepare_session
    config=config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 188, in _restore_checkpoint
    saver.restore(sess, ckpt.model_checkpoint_path)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1428, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 767, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 965, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1015, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1035, in _do_call
    raise type(e)(node_def, op, message)
InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [2048,3] rhs shape= [2048,2]
	 [[Node: save/Assign_820 = Assign[T=DT_FLOAT, _class=["loc:@SecondStageBoxPredictor/ClassPredictor/weights"], use_locking=true, validate_shape=true, _device="/job:ps/replica:0/task:0/cpu:0"](SecondStageBoxPredictor/ClassPredictor/weights, save/RestoreV2_820)]]
	 [[Node: save/restore_all/NoOp_1_S8 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/gpu:0", send_device="/job:ps/replica:0/task:0/cpu:0", send_device_incarnation=-4427910840146810295, tensor_name="edge_6_save/restore_all/NoOp_1", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/gpu:0"]()]]

Caused by op u'save/Assign_820', defined at:
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 198, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 194, in main
    worker_job_name, is_chief, FLAGS.train_dir)
  File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 281, in train
    keep_checkpoint_every_n_hours=keep_checkpoint_every_n_hours)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1040, in __init__
    self.build()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1070, in build
    restore_sequentially=self._restore_sequentially)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 675, in build
    restore_sequentially, reshape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 414, in _AddRestoreOps
    assign_ops.append(saveable.restore(tensors, shapes))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 155, in restore
    self.op.get_shape().is_fully_defined())
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_state_ops.py", line 47, in assign
    use_locking=use_locking, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2327, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1226, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [2048,3] rhs shape= [2048,2]
	 [[Node: save/Assign_820 = Assign[T=DT_FLOAT, _class=["loc:@SecondStageBoxPredictor/ClassPredictor/weights"], use_locking=true, validate_shape=true, _device="/job:ps/replica:0/task:0/cpu:0"](SecondStageBoxPredictor/ClassPredictor/weights, save/RestoreV2_820)]]
	 [[Node: save/restore_all/NoOp_1_S8 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/gpu:0", send_device="/job:ps/replica:0/task:0/cpu:0", send_device_incarnation=-4427910840146810295, tensor_name="edge_6_save/restore_all/NoOp_1", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/gpu:0"]()]]

原因

学習用として与えた画像の大きさが不適切だったようだ。

学習用のconfには最短と最長の長さを定義しておいた。

model {
  faster_rcnn {
    num_classes: 2
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 240
        max_dimension: 1280
      }
    }

これにより小さかったり大きかったりする画像はこのサイズまでリサイズされると思ったんだが、
その影響で指定していた矩形の座標が狂って画像の範囲外を指定してしまっている?
what the field "keep_aspect_ratio_resizer" means in the .config file? · Issue #1794 · tensorflow/models

と、思ったら違った。どんな画像でもこのサイズに変更してからCNNに入れ込むらしい。

論文紹介: Fast R-CNN&Faster R-CNN

でもその手法はR-CNNでFaster R-CNNだとRoI pooling layerとかいうので処理しているらしい。
わけがわからなくなってきた。あとでちゃんと調べよう。

回避策

事前に与える画像のサイズを決めておいて、confにもそのサイズを記載しておく。

model {
  faster_rcnn {
    num_classes: 2
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 480
        max_dimension: 640
      }
    }

TFRecordを作るときに矩形の座標を割合で入れて入れば他には特に変更すべきところはないと思う。

とりあえず指定した最短に足りなかったり、最長を超えていたりすることがなければエラーにならないようだ。
原因が理解できてないから気持ち悪いな。

Pocket