2025.09.22

読了時間:5分

gpt-oss-120bを試してみた

國田圭佑

どうも！soda AIエンジニアの國田です。

前回、OpenAIが公開したオープンソースのチャットAIモデル gpt-oss について、紹介させていただきました。その際は小型版である gpt-oss-20b を中心に紹介と検証をさせていただいたのですが、大型モデルのgpt-oss-120bについては、まだ未検証でした。
今回は改めて、性能を試してみましたので、その結果を共有できればと思います。
オープンソースモデルの中でも、かなり巨大なモデルであるgpt-oss-120b・・・その日本語品質や推論性能はどれほどのものでしょうか？

gpt-ossとは？

まず始めに、gpt-ossについて復習しておきましょう。gpt-ossはOpenAIが公開したオープンソースの言語モデルです。オープンソースの言語モデルと言いますと、小規模言語モデル(SLM; Small Language Model)が挙げられますが、それらはおおよそパラメータ数として、70億くらいのものが標準です。gpt-ossは、少しパラメータ数が多めでパラメータ数200億のgpt-oss:20bと1200億のgpt-oss:1200bが公開されています。

また、gpt-ossは「Reasoning Model（推論モデル）」と呼ばれるチャットAIモデルになります。
本題に入る前に、このReasoning Modelとは何かを説明しておきます。

Reasoning Modelとは？

普段、ChatGPTを使っていて「思考中」とか「考えています」と表示されているのを見たことがありませんか？
これが、"Reasoning Model"の特徴であり、ユーザーの入力を元に、自身で思考を巡らせ、回答をまとめる(="Reasoningを行う")という特徴を持っています。

⚠️ちなみに、便宜上、「思考」という単語を使いましたが、私個人としては「AIが思考している」という表現は必ずしも適切ではないと思っています。生成AIはあくまでも、学習したコーパスから次の単語を予測するパターン予測器であり、確率的に単語を連続生成しているに過ぎないからです。

ChatGPTのようなテキスト生成AIは、「たくさんの本やネットの記事を読んで覚えた知識から、一番それっぽい次の言葉を選ぶ」という特徴を持っているのですが、Reasoning Modelの場合ですと、それを更に細かく分解してステップバイステップで回答をしていくというイメージです。

具体的に確認してみましょう。
入力として、以下のような文章を入れた場合で考えてみます。

トムが8個のりんごを持っています。
そのうち3個を食べ、さらに2個を友達からもらったら、最後にトムが持っているりんごはいくつになるでしょう？

◯ Reasoningを使わないAI(Non-Reasoning Model)の場合
　入力に対して直接的に最終回答を返す。
　回答例：「7」
→ つまり、「8 − 3 + 2」という計算を内部で行っているわけではなく、「こういう問題文のあとには『7』という答えが多そうだ」という確率予測に基づいて出力しています。そのため、計算や論理展開が複雑になると誤答の可能性が高まります。というのは、「こういう問題文のあとには数字で答える可能性が高い」という程度でしかNon-Reasoning Modelは理解していないからです。極端な話、上記例での回答文が「10」や「100」になっても、AIモデル的には大きな違いとして認識されていません。

◯　Reasoning Modelの場合
最終的な答えを出力する前に、途中の思考過程を明示的に展開しながら答えに辿り着きます。
　回答例：「トムは最初に8個のりんごを持っている。3個を食べたので 8 − 3 = 5。さらに2個もらったので 5 + 2 = 7。したがって答えは7。」
→ こうすることで複雑な問題でも段階的に正解に近づきやすくなります。

この違いにより、数理問題の解答や複雑な推論が必要な質問に対しても、Reasoning Modelはより正確で一貫性のある回答を提示できるとされています。例えば算数の文章問題では、途中の計算式を自分で書き出して「まずこう計算して、次にこうするから...答えは○○」と順を追って説明できますし、論理パズルでも、「もしAが成り立つならBも成り立つ。ではCは...」と順を追って結論を導くことができます。

人間のように本当に「考えている」わけではありませんが、説明や解釈を羅列しながら、思考しているかのような振る舞いをして筋道を立てることができるため、複雑な問題に強くなっているとされています。
「Reasoning Model＝高度な推論ができるように作られたAI」という訳です。

では、以上を踏まえた上で、gpt-oss-120bを試してみましょう。
今回は、その思考過程を出力させながら、挙動を確認してみたいと思います。

さぁ、やってみよう

実は、gpt-ossは、推論の深さを切り替えることができます。推論の深さは、「推論部分に使った文章の長さ」で表され、例えば、浅い思考（Low（低））に設定すると、回答前に数十単語程度だけReasoningを行います。一方、深い思考(High（高）)では数百〜数千の単語から成る文章を組み立て、詳細な推論過程を内部で展開しながら回答を作成するようになります。

今回は、gpt-oss-120bの推論能力を測るため、推論の深さを「low」「medium」「high」の三段階で切り替えつつ、いくつかのタスクに挑戦させてみたいと思います。
なお、推論の深さ以外の条件（プロンプト文やその他生成パラメータ）は統一しておきます。

タスク① 状況推理

以下のようなプロンプトを投げてみます。

5人(A、B、C、D、E)が部屋にいました。
AさんはBさんと一緒にテレビでボディビルを観戦しており、Dさんは寝ており、Bさんはサラダチキンを食べていました。また、Eさんはペアトレーニングをしています。

突然、家に電話がかかってきました。
Bさんは電話に出るために部屋を出ました。Cさんは何をしているのでしょうか？

この問題の答えは、「Eさんと一緒にペアトレーニングをしている」になります。
部屋には5人しかおらず、Eさんのペア相手が言及されていないため、必然的に余ったCさんがEさんのペアトレーニングのパートナーということになるからです。

では、答えを確認してみましょう。
(なお、以下、reasoning部分につきましては、長いため、たたんでおります。クリックで展開・圧縮が可能です。言語モデルの最終的な答えは、その下の「----- answer -----」以下になります。)

Low

reasoning (クリックで展開)

We need answer: C is doing nothing? Probably watching? Given others occupied, C likely watching TV? Actually A and B watching TV, B left, so A alone watching TV. C maybe also watching TV or something else unspecified. Probably C is watching TV.

----- answer -----
Cさんは特に何も言及されていないので、部屋の中で残っている人は AさんとCさんだけです。
Aさんはまだテレビでボディビルの映像を見続けているので、Cさんはその隣でテレビを見ている、あるいは何もしていない（ただ座っている）と考えるのが自然です。
したがって、Cさんはテレビを見ている（または特に何もしていない） と答えるのが妥当です。

残念ながら間違っていますね。「Cさんはテレビを見ている（または何もしていない）」と、正しい類推はできていないようです。Reasoningの中身も割と早い段階から「Cさんはテレビを見ているのではないか？」と仮定をし、そのまま回答にまでいってしまっているように見受けられます。
ただ、日本語の性能はすごく高いですね。非常に自然な文章が生成されているように見受けられます。
では、続けて、mediumの結果を確認してみたいと思います。

medium

reasoning (クリックで展開)

We need to answer the puzzle: 5 people A,B,C,D,E in a room. A watches TV with B, D sleeping, B eating salad chicken, E doing pair training. Suddenly phone rings. B leaves the room to answer. What is C doing? Likely nothing mentioned, so maybe "C is doing nothing" or "C is watching TV"? Let's parse: A is watching TV with B, so both A and B are watching TV. B also eating salad chicken? Could be multitasking. E is doing pair training (maybe with someone else? maybe with A?). D sleeping. So C is the only one not assigned. So answer: C is doing nothing / just there. Could be "C is watching TV with A and B"? But B left, so after B leaves, A still watching TV, maybe C joins? The puzzle likely expects "C is watching TV" because A was watching TV with B, but after B leaves, maybe C is now watching with A. But not given. Could be "C is doing nothing" as the only remaining activity. Usually such puzzles answer: "C is watching TV" because A was watching TV with B, but B left, so maybe C is now watching with A? However the question: "Cさんは何をしているのでしょうか？" It might be a trick: Since no info, we cannot know. So answer: "We don't know." Could be "C is doing nothing (idle)". I'd answer that we have no info, so cannot determine.

----- Answer ------
この状況で C さんに関しては何も情報が与えられていません。
- A さんはテレビを見ている（B さんと一緒に）
- D さんは寝ている
- B さんはサラダチキンを食べていて、電話が鳴ったので部屋を出た
- E さんはペアトレーニングをしている

以上の4人の行動は明示されていますが、C さんについては一切言及されていません。
したがって、C さんが何をしているかはこの情報だけからは分かりません。
（例：何もしていない、別のことをしている、あるいは A と同じくテレビを見ている、などはすべて推測にすぎません。）

「情報が与えられていないから答えられない」とのこと。理屈としては間違ってはいませんね。
ただ、正直、私が期待していた答えとは違っています。reasoning部分もそれなりに長文を費やしてはいますが、思考力不足が否めませんね。
では、highの結果はどうなるのでしょうか？

high

reasoning (クリックで展開)

We have a puzzle. The user describes a scenario: 5 people (A, B, C, D, E) in a room. A is watching bodybuilding on TV with B. D is sleeping. B is eating salad chicken. E is doing pair training. Suddenly, the phone rings. B leaves the room to answer the phone. The question: "Cさんは何をしているのでしょうか？" (What is C doing?) We need to infer from the given information. The scenario: A, B, C, D, E are in a room. A is watching TV with B. D is sleeping. B is eating salad chicken. E is doing pair training. Then phone rings, B leaves. So we need to deduce what C is doing. The puzzle likely expects that C is doing nothing, or maybe something else. Let's parse. We have 5 people: A, B, C, D, E. The description: A is watching TV with B. D is sleeping. B is eating salad chicken. E is doing pair training. So we have A and B together watching TV. B also eating salad chicken. So B is multitasking: watching TV and eating. E is doing pair training (maybe with someone else). Pair training implies two people training together. Who is the partner? Could be C? Or maybe A? Or maybe B? But B left to answer phone. So before B left, E was doing pair training. Who is the partner? Could be C. So C is training with E. Or maybe C is also watching TV? But A is watching TV with B. So maybe C is not mentioned. The puzzle likely expects that C is doing nothing (maybe "Cさんは何もしていない" or "Cさんは何もしていない" or "Cさんは何もしていない" or "Cさんは何もしていない" or "Cさんは何もしていない" or "Cさんは何もしていない"). But maybe the puzzle is a trick: The scenario says "AさんはBさんと一緒にテレビでボディビルを観戦しており、Dさんは寝ており、Bさんはサラダチキンを食べていました。また、Eさんはペアトレーニングをしています。" So we have A and B watching TV, D sleeping, B eating, E doing pair training. So we have 5 people: A, B, C, D, E. The description mentions A, B, D, E, but not C. So we need to infer what C is doing. The puzzle might be a logic puzzle: each person is doing a distinct activity, and we need to assign activities to each person. The activities mentioned: watching TV (with B), sleeping, eating salad chicken, pair training. That's four activities. There are five people, so one activity is missing. Possibly "Cさんは何か別のことをしている" like "Cさんは電話に出る" but B left to answer phone. So maybe C is answering phone? But B left to answer phone. So C is not answering. Could be "Cさんは何もしていない" (idle). Or maybe "Cさんは何か別のことをしている" like "Cさんはトイレに行く" or "Cさんは何か別のトレーニングをしている". But we need to deduce from the given info. Let's think about typical logic puzzles: They often give a set of people and a set of activities, and each person does a unique activity. The puzzle might be incomplete, but maybe we can infer that C is doing the remaining activity not mentioned. The activities mentioned: watching TV (A and B), sleeping (D), eating salad chicken (B), pair training (E). So we have: watching TV, sleeping, eating, pair training. That's four distinct activities. But B is doing two activities: watching TV and eating. So maybe each person can do multiple activities. But the puzzle likely expects that C is doing something else, maybe "Cさんは何もしていない" (C is doing nothing). Or maybe "Cさんは電話に出る" but B left to answer phone. So maybe C is the one who answered the phone? But B left to answer, so no. Alternatively, maybe the puzzle is a trick: The scenario says "AさんはBさんと一緒にテレビでボディビルを観戦しており、Dさんは寝ており、Bさんはサラダチキンを食べていました。また、Eさんはペアトレーニングをしています。" So we have A and B watching TV, D sleeping, B eating, E doing pair training. So we have A, B, D, E accounted for. C is not mentioned. So the answer is "Cさんは何もしていません" (C is doing nothing). Or "Cさんは何か別のことをしているが、情報が与えられていないのでわからない". But the puzzle likely expects "Cさんは何もしていない" because it's the only one left. But maybe the puzzle is a riddle: "5人が部屋にいました...Cさんは何をしているのでしょうか?" The answer might be "Cさんは部屋にいるだけです" (C is just there). Or "Cさんは何もしていない". Or "Cさんは何か別のことをしているが、情報が足りないので答えられない". But typical puzzles like this often have a twist: The answer is "Cさんは電話に出た" because B left to answer, but maybe C answered? But the text says "Bさんは電話に出るために部屋を出ました". So B left to answer. So C didn't answer. Maybe the puzzle is about "ペアトレーニング" meaning pair training, which requires two people. So E is doing pair training with someone else. Who is the partner? Could be C. So C is training with E. That would be a plausible answer: C is doing pair training with E. But the text says "Eさんはペアトレーニングをしています". It doesn't specify with whom. It could be with C. So C is training with E. That would fill the missing activity for C. But then B is also doing pair training? No, B is watching TV and eating. So C is the partner for E. That seems plausible. But the puzzle might be a typical "logic puzzle" where each person is doing a distinct activity, and we need to assign them. The given activities: watching TV (A and B), sleeping (D), eating salad chicken (B), pair training (E). But we have 5 people, and we have 4 distinct activities. However, pair training involves two people, so that accounts for two participants: E and someone else. That someone else could be C. So C is doing pair training with E. That would give each person an activity: A watching TV, B watching TV and eating, D sleeping, E pair training, C pair training. So C is training with E. That seems plausible. But the puzzle might be a trick: The answer is "Cさんは何もしていない" because the puzzle is incomplete. But typical puzzles like this often have a hidden clue: "ペアトレーニング" means training in pairs, so there must be a partner. Since we have 5 people, the only one not accounted for is C, so C must be the partner. So C is doing pair training with E. Thus answer: C is doing pair training with E. But we need to consider the scenario: A and B are watching TV together. B is also eating salad chicken. So B is multitasking. D is sleeping. E is doing pair training. The phone rings, B leaves. So after B leaves, A is still watching TV alone? Or maybe A continues watching TV. E is still doing pair training with C. D is sleeping. So C is training with E. So answer: C is doing pair training with E. Alternatively, maybe the puzzle expects "Cさんは何もしていない" because the puzzle is a trick: The only person not mentioned is C, so we cannot know. But typical puzzles want you to infer that C is the partner for E's pair training. Let's examine the phrasing: "Eさんはペアトレーニングをしています。" In Japanese, "ペアトレーニング" means training with a partner. So if E is doing pair training, there must be a partner. The only person not accounted for is C. So C must be the partner. So C is doing pair training with E. Thus answer: C is doing pair training with E. But we need to be careful: The puzzle says "5人(A、B、C、D、E)が部屋にいました。 AさんはBさんと一緒にテレビでボディビルを観戦しており、Dさんは寝ており、Bさんはサラダチキンを食べていました。また、Eさんはペアトレーニングをしています。突然、家に電話がかかってきました。 Bさんは電話に出るために部屋を出ました。Cさんは何をしているのでしょうか？" Thus we have: A is watching TV with B. D is sleeping. B is eating salad chicken. E is doing pair training. Then phone rings, B leaves. So what is C doing? The answer: C is training with E (pair training). So C is doing pair training. Thus answer: C is doing pair training with E. Alternatively, maybe C is also watching TV with A after B leaves? But the phrase "AさんはBさんと一緒にテレビでボディビルを観戦しており" suggests A and B are watching TV together. After B leaves, A might be watching TV alone. But C is not mentioned. So maybe C is also watching TV? But the phrase "AさんはBさんと一緒にテレビでボディビルを観戦しており" implies that A is watching TV with B, not with C. So C is not watching TV. Thus the only plausible answer is that C is the partner for E's pair training. Thus answer: C is doing pair training with E. But we need to consider if there is any other hidden nuance. The puzzle might be a trick about "ペアトレーニング" meaning "pair training" which could be a type of exercise where two people train each other. So E is training with a partner. The only person not accounted for is C. So C is the partner. So C is doing pair training. Thus answer: C is doing pair training with E. Thus we can answer: C is training with E (pair training). Or "CさんはEさんとペアトレーニングをしています". That is the answer. We can also mention that C is not doing anything else, like not sleeping, not watching TV, not eating. So the answer is that C is the partner for E's pair training. Thus answer: C is doing pair training with E. We can also note that after B leaves, A continues watching TV alone, D continues sleeping, E continues pair training with C, and C is training with E. Thus answer: C is doing pair training with E. Thus we can answer in Japanese: "CさんはEさんとペアトレーニングをしています". Or "CさんはEさんのペアトレーニングの相手です". Or "CさんはEさんと一緒にトレーニングをしています". That is the answer. Thus answer: C is doing pair training with E. We can also mention that the puzzle is solved by noticing that "ペアトレーニング" requires a partner, and the only person not assigned an activity is C. Thus answer: C is training with E. Thus final answer: C is doing pair training with E. We can also add a short explanation: Because E is doing pair training, there must be a partner, and the only person not mentioned is C, so C must be the partner. Thus answer: C is doing pair training with E. Thus final.

Answer:
Cさんは Eさんとペアトレーニングをしています。

理由

- 「Eさんはペアトレーニングをしています」とあるので、Eさんにはトレーニングの相手が必要です。

- A、B、D はそれぞれテレビ観戦・食事・睡眠で既に役割が決まっており、残っている人物は C さんだけです。

- したがって、C さんは E さんのペアトレーニングの相手、すなわち一緒にトレーニングをしていると考えられます。

（電話が鳴って B さんが部屋を出た後も、A さんはテレビを見続け、D さんは寝続け、E と C がペアでトレーニングを続けている、という状態になります。）

見事、正解です！きちんと「CがEのペアトレーニングの相手だ」という結論に辿り着いています。
ちなみに、reasoningの中身は、結構冗長に似たような内容を繰り返しているようでした。綿密すぎるくらいシチュエーションを繰り返し繰り返し反芻しながらチェックしているようにも見えますね。

この結果から、推論をより深く設定することでモデルの解答精度が向上していることが分かります。
「low」や「medium」では思考が足りず、厳密な正解にまでは辿り着けませんでしたが、「high」に設定することで、問題設定の全体像をきちんと把握して正しい答えに到達することができました。
ステップを多く踏み、より深く考えることが正確な回答生成のために有効であるようです。

タスク② 引っ掛け問題でのテスト

続けて、gpt-ossに、少し意地悪な問題を解かせたいと思います。

見た目と重さの同じ8つのボールがあります。天秤を使って、重いボールを見つけてください。
どのようにしますか？
但し、天秤を使う回数は可能な限り少なくしてください。

通常ですと、「8つのボールの中に1つだけ軽いボールがある」という問題設定ですが、ここでは、「見た目と重さが全て同じ」としています。そのため、「そもそも重いボールはありません」というのが、このテストの正答になります。
では、それぞれ答えを確認していきましょう

Low

reasoning (クリックで展開)

We need answer in Japanese. Explain classic 8 balls one heavier, find with balance scale minimal weighings. 3 weighings enough. Explain strategy.

----- Answer -----
8 個のボールのうち 1 個だけが他より重いとします。天秤（二本の皿があるはかり）を最小 2 回の測定で必ず重いボールを特定できます。（3 回では冗長で、2 回で十分です。）

1 回目の測定

1. ボールを A-B-C-D と E-F-G-H の 2 グループに分けます。

2. 左皿に A, B, C、右皿に E, F, G を乗せます。

結果の分岐

天秤の結果	次に調べるべきボール
左が下がる（A,B,C が重い）	A, B, C
右が下がる（E,F,G が重い）	E, F, G
釣り合う	D または H

2 回目の測定

(1) 左が下がった場合（A,B,C が候補）

左皿に A、右皿に B を乗せる。

結果	重いボール
左が下がる	A
右が下がる	B
釣り合う	C

(2) 右が下がった場合（E,F,G が候補）

左皿に E、右皿に F を乗せる。

結果	重いボール
左が下がる	E
右が下がる	F
釣り合う	G

(3) 釣り合った場合（D または H が候補）

左皿に D、右皿に H を乗せる。

結果	重いボール
左が下がる	D
右が下がる	H

まとめ

必要な測定回数	理由
2 回	天秤の 1 回の測定で結果は 3 通り。2 回で 3²＝9 通りを区別でき、8 通りを十分に区別可能。
3 回	余裕を持った手順説明にはなるが、実際は不要。

最小回数は 2 回です。上記の手順に従えば必ず最小回数で重いボールを見つけられます。

なんと、引っ掛け問題に回答できていません。よくある「重いボールの当て方」に注目してしまい、回答としては間違っています。また、回答そのものも、天秤の利用回数が2回と言ったり3回と言ったりして、論理的には整合性に欠いているようにも見受けられますね。
続けて、mediumの結果を確認してみたいと思います。

medium

reasoning (クリックで展開)

The user asks in Japanese: "There are 8 balls that look the same and have the same weight. Using a balance scale, find the heavy ball. How do you do it? However, use the balance as few times as possible." Classic problem: 8 balls, one heavier, find it with minimal weighings. With a balance scale, you can find heavy ball among 8 in at most 2 weighings? Let's think: In each weighing, you can get three outcomes (left heavy, right heavy, balanced). So with 2 weighings you can distinguish up to 3^2 = 9 possibilities, so 8 is possible. So answer: 2 weighings. Procedure: First weighing: put 3 vs 3. If balanced, heavy is among remaining 2, then weigh one of them against a normal ball to find heavy. If unbalanced, heavy is in heavier side (3 balls). Then second weighing: weigh 1 vs 1 from those three. If balanced, heavy is the third; if not, heavier side is heavy. Thus 2 weighings. Explain steps. Also could mention alternative: weigh 3 vs 3, then 1 vs 1. Thus answer. Provide explanation in Japanese.

----- Answer -----
答え： 2 回の天秤で必ず見つけられます。 天秤は 1 回の測定で「左が重い」「右が重い」「釣り合う」の 3 通りの結果が得られます。
2 回の測定で得られる組み合わせは3 × 3 = 9通りですから、最大 9 個の候補の中から 1 個を特定できます。
8 個のボールは 9 以下なので、2 回で必ず判定可能です。

手順

1. 第 1 回目の測定

ボール 8 個 → 3 個ずつの 2 グループと、残り 2 個に分けます。

左皿に 3 個、右皿に 3 個を乗せて測ります。

結果	次にすべきこと
釣り合う	重いボールは残りの 2 個のどちらか。
左が重い	重いボールは左皿の 3 個の中。
右が重い	重いボールは右皿の 3 個の中。

2. 第 2 回目の測定

ケース A： 第 1 回で釣り合った（残り 2 個が候補）

残り 2 個のうち 1 個を左皿に、既に軽いと分かっているボールを右皿に置く。

左が重ければその 1 個、釣り合えば残りのもう 1 個が重いボール。

ケース B： 第 1 回で左右いずれかが重い（3 個が候補）

その 3 個のうち 2 個を取り出し、1 個ずつ天秤の左右に置く。

結果	重いボール
釣り合う	残した 1 個
左が重い	左に置いたボール
右が重い	右に置いたボール

まとめ

最小回数： 2 回

手順：

① 3 個 vs 3 個を測り、釣り合えば残り 2 個、重ければ重い側の 3 個に絞る。

② 残った候補を 1 個 vs 1 個で測り、釣り合えば残りの 1 個、傾けば重い側。

この方法なら、天秤の使用回数を最小に抑えて確実に重いボールを見つけられます。

mediumも間違った結果となりました。面白い点はreasoningの時点で"There are 8 balls that look the same and have the same weight."(8個のボールがあり、見た目も重さも同じである)としっかり反芻している点ですね。にもかかわらず、直後に"Classic problem:"と、昔からある課題のように話を進めてしまっています。

最後に、highの結果を確認してみたいと思います。

high

reasoning (クリックで展開)

The user asks in Japanese: "There are 8 balls that look the same and have the same weight. Using a balance scale, find the heavy ball. How do you do it? However, use the balance as few times as possible." Interpretation: There are 8 balls that appear identical, but one is heavier (or maybe heavier than the others). The goal is to find the heavy ball using a balance scale (presumably a two-pan balance) with the minimal number of weighings. The classic problem: find the heavier ball among 8 using a balance scale. Usually, the minimal number of weighings is 2? Let's think: With 8 balls, you can find the heavier ball in at most 2 weighings? Let's see: In each weighing, you can compare groups of balls. The outcome can be left heavier, right heavier, or balanced. That's 3 possible outcomes per weighing. So with n weighings, you can distinguish up to 3^n possibilities. For 8 possibilities (which ball is heavy), we need n such that 3^n >= 8. 3^1 = 3 < 8, 3^2 = 9 >= 8. So 2 weighings are sufficient. So the answer: 2 weighings. Now we need to describe the method: First weighing: weigh 3 balls vs 3 balls. If they balance, the heavy ball is among the remaining 2 balls. Then second weighing: weigh those 2 balls against each other, heavier one is heavy. If the first weighing is unbalanced, the heavy ball is among the heavier side's 3 balls. Then second weighing: weigh 1 ball vs 1 ball from those 3. If they balance, the third ball is heavy; if not, the heavier side is heavy. Thus 2 weighings. Alternatively, you could weigh 3 vs 3, then weigh 1 vs 1 among the suspect group. Thus answer: 2 weighings. But the user might want a step-by-step explanation. Also note that the problem says "見た目と重さの同じ8つのボールがあります" which translates to "There are 8 balls that look the same and have the same weight." Actually "見た目と重さの同じ" means "same appearance and weight"? But then "重いボールを見つけてください" means "Find the heavy ball." So maybe one ball is heavier than the others, but they appear identical. So we need to find the heavy ball. Thus answer: Use 2 weighings. We can also discuss the theoretical lower bound: Since each weighing yields 3 outcomes, we need at least ceil(log_3(8)) = 2 weighings. Thus answer: 2 weighings. We can also mention that if we had more balls, the minimal weighings would be ceil(log_3(N)). But for 8, it's 2. Thus answer: weigh 3 vs 3, then weigh accordingly. We can also propose alternative method: weigh 4 vs 4? That would give heavy side with 4 balls, then weigh 2 vs 2, then weigh 1 vs 1. That would be 3 weighings, not minimal. Thus the minimal is 2 weighings. Thus answer: 2 weighings. Now we need to respond in Japanese, presumably. The user wrote in Japanese. So answer in Japanese. We can also include a diagram or description. Thus final answer: 2 weighings, with steps. We can also mention that if the heavy ball is heavier, not lighter, we can use the same method. Thus answer: 2 weighings. Now we need to produce a final answer. We should also consider if the problem might be a trick: "見た目と重さの同じ8つのボールがあります" could be interpreted as "There are 8 balls that are identical in appearance and weight." But then "重いボールを見つけてください" suggests there is a heavy ball among them. So maybe the heavy ball is heavier than the others, but they appear identical. So we need to find the heavy ball. Thus answer: 2 weighings. We can also mention that if we have a scale that can measure weight difference, we could weigh all at once, but the problem says "天秤" which is a balance scale, not a digital scale. Thus answer: 2 weighings. Now we can produce final answer. We can also include a table of possibilities. Thus final answer: 2 weighings. Now we need to produce final answer in Japanese, with explanation. We can also mention the theoretical lower bound: log_3(8) ~ 1.89, so need at least 2 weighings. Thus answer: 2 weighings. Now we can produce final answer.

----- Answer -----
答え： 2回の天秤で必ず見つけられます。

なぜ 2 回で足りるのか（理論的根拠）

天秤の 1 回の測定では、

・左側が重い

・右側が重い

・両側が釣り合う

の 3 通り の結果が得られます。

n 回測定したときに区別できる場合の数は \(3^n\) です。8 個のうちどれが重いかを判別するには少なくとも \(3^n \ge 8 \;\Longrightarrow\; n \ge \lceil \log_3 8 \rceil = 2\) が必要です。

したがって 2 回で十分、2 回未満では不可能です。

実際の手順（2 回で重いボールを特定）

1 回目の測定：

3 個 と 3 個 を天秤にかける。

結果	次に調べるべきボール
釣り合う	残りの 2 個が候補
左側が重い	左側の 3 個が候補
右側が重い	右側の 3 個が候補

2 回目の測定：

(a) 1 回目が釣り合った場合 → 残っている 2 個を 直接比較（1 個 vs 1 個）。重い方が答え。

(b) 1 回目で重い側が出た場合 → その側の 3 個から 1 個 vs 1 個 を測る。

結果	重いボール
釣り合う	比べなかった残りの 1 個
左側が重い	左皿に乗せた 1 個
右側が重い	右皿に乗せた 1 個

まとめ

① 3 vs 3 を測り、重さが出た側（または釣り合えば残り 2 個）に絞る。

② 絞った 3 個（または 2 個）を 1 vs 1 で測る。

この手順で 天秤はたった 2 回 だけで、8 個の中の重いボールを必ず見つけられます。

（※「軽い」ボールを探す場合でも同様に、軽い側を追っていけば 2 回で判別可能です。）

なんと、highにしても引っ掛け問題であるという点は見過ごしてしまいました。reasoningの中身を確認してみます。

"Also note that the problem says "見た目と重さの同じ8つのボールがあります" which translates to "There are 8 balls that look the same and have the same weight." Actually "見た目と重さの同じ" means "same appearance and weight"? But then "重いボールを見つけてください" means "Find the heavy ball." So maybe one ball is heavier than the others, but they appear identical. "
(「見た目と重さの同じ8つのボールがあります」は「There are 8 balls that look the same and have the same weight.」と訳すことができます。ここにも注目しなくてはなりません。実際に「見た目と重さの同じ」というのは、外見と重量が同じことを意味するのでしょうか？しかし、「重いボールを見つけてください」というのは、「Find the heavy ball.」という意味です。したがって、おそらく1つのボールは他よりも重いのでしょう。)

与えられた問題に対して、勝手に解釈を加えてしまっているように見受けられます。
「問題の前提部分が間違っているのでは？」という仮定を一方的にしてしまい、命題である「重いボールを見つけて」の方を順守してしまったようです。
プロセスとしては納得のいくものですが、「思い込み」によって勝手に暴走したとも言えそうです。
仕事でやっていたら「ちゃんと確認しろ」って怒られるやつですね・・これ。

さて、深い推論モードのreasoningを見ると、モデルは一度正解である「見た目と重さが同じ」という部分に着目をしていながら、自らそれを退けてるような挙動をしてしまっていることがわかります。人間であれば「いや待て、ひょっとして問題設定が間違っているのか？確認した方が良いのではないか？」と気づけるところを、モデルはあえて通常の解釈に戻ってしまったわけです。

これは、LLMの「たくさんの本やネットの記事を読んで覚えた知識から、一番それっぽい次の言葉を選ぶ」という性質を如実に反映しており、訓練データ上で上記のような天秤問題の質問パターンに何度も遭遇していたのではないかと考えられます。それゆえに、その統計的パターンに回答が引きずられてしまい、今回のように「引っ掛け問題」という点に至らなかったのだと考えられます。
推論を深く繰り返したからといって必ずしも正答に至るわけではなく、間違った思い込みをして、もっともらしく説明してしまうケースも存在するということですね。

⚠️今回の挙動は、先行研究でも報告されている現象と一致します。Googleの研究（Malek et al., 2025）やAppleのレポート（Shojaee et al., 2025）では、最新の推論型モデルですら人間には簡単なタスクでつまずいたり、逆に考え過ぎてしまう事例が多数紹介されています。
つまり、gpt-oss-120bが深い推論を展開しながらも誤答してしまったのは特異な例ではなく、LLM全般に見られる性質のひとつだと言えるでしょう。執筆時点では、gpt-5も同様にこの類の引っ掛け問題には正解できませんでした。

実は、これ如実に昨今の言語モデルの特徴を反映しています。
よく「生成AIは人間を超える」とか「新しい発見をできる」と喧伝されることがありますが、実際のテキスト生成モデルが行っているのは、「無から全く新しいものを生み出す」ことではなく、「学習したデータに基づくパターンを組み合わせて、新しいように見える回答を生成する」ことです。
そのため、学習していない領域や分布外のパターンに直面すると性能が低下することが指摘されており、例えば、Huang et al., 2025では、「多くのモデルが訓練分布外では性能が大きく落ちる」と述べられています。

そのような場合、どう対応させるのが適切でしょうか？
実は、メタ認知的なプロンプト設計を行うことで解決できる場合があります。「前提に矛盾はないか？」「問題文にトリックはないか？」といった内容の確認を入れることで、モデルが入力の異常に気づきやすくしてあげます。

例えば、以下のようにプロンプトを設計します。

見た目と重さの同じ8つのボールがあります。天秤を使って、重いボールを見つけてください。
どのようにしますか？
但し、天秤を使う回数は可能な限り少なくしてください。
また、回答の際には、問題文の前提に矛盾やトリックが含まれていないか確認してください。矛盾のある場合には、矛盾点を指摘するのみにし、解答の生成は控えてください。

この状態で、Reasoningを深度"low"で設定し、確認してみましょう。

Low

reasoning (クリックで展開)

We need to check for contradictions. Problem: 8 balls identical in appearance and weight? Actually says "見た目と重さの同じ8つのボールがあります。天秤を使って、重いボールを見つけてください。" Means there are 8 balls that look the same and have the same weight? But then says find the heavy ball, implying one is heavier. Contradiction: if all have same weight, no heavier ball. So contradiction. Answer should point out contradiction and not solve.

----- Answer: -----
矛盾点の指摘

問題文には次の二つの前提が示されています。

1. 「見た目と重さの同じ 8 つのボールがあります」
→ すべてのボールは外観も重量も同一である、と解釈できます。

2. 「天秤を使って、重いボールを見つけてください」
→ 8 個の中に「重い」ボールが少なくとも 1 個存在し、他のボールと重量が異なることが前提です。

この二つの前提は相互に矛盾しています。
- 前提 1 では全てのボールが同じ重さであるため、どのボールも「重い」ものとして区別できません。
- 前提 2 では少なくとも 1 個のボールが他と異なる（重い）重量を持つ必要があります。

したがって、問題文に矛盾が含まれているため、解答（天秤の使い方や回数の最小化）を提示することはできません。この矛盾が解消される（例：8 個のうち 1 個だけが他より重い、あるいは「見た目は同じだが重さは異なる」など）まで、解答は控えさせていただきます。

Reasoning内部に「Contradiction: if all have same weight, no heavier ball. So contradiction. Answer should point out contradiction and not solve.」(矛盾: もし全てのボールが同じ重さなら重いボールは存在しない。矛盾があるため、回答は矛盾を指摘し、問題を解いてはいけない)と出ています！見事に正解です！

特筆すべきは、思考の深さが"low"の場合でも、メタ認知的なプロンプトを挟むことによって、矛盾に気づいてきちんと回答できたという点ですね。

考察

オープンソースモデル：gpt-oss-120bはGPT-5などと比較して小さいモデルではありますが、その出力には目を見張るものがあります。問いの設計によっては、意地悪問題などに正確に回答ができませんが、これはモデルそのものの「能力不足」というわけではなく、「どのように問いを与えられるか」に大きく依存しているという結果が示唆されました。
通常のプロンプトでは典型的な解法パターンに流され、問題文の前提を疑わないまま解答してしまうところでしたが、メタ認知的なプロンプト（「矛盾や前提条件をチェックせよ」という認知的フレーム）を与えることで、思考の深さを high に設定せずとも正しく矛盾に気づけました。

タスク①では、推論を深くするほど正確な答えに近づいていきましたが、それをせずとも「どの観点を強調するか」という工夫でも、出力を大きく変えることができることを示唆しています。推論の正確さは必ずしも思考の深さに比例するというわけではないようですね。
言語モデルに完全にReasoningをお任せするのではなく、「どの観点でReasoningさせるか」という設計もまた重要になります。

まとめ

いかがでしたか？
gpt-oss-120b は日本語での推論力が高い一方で、パターン依存的に誤答する弱点を抱えています。しかし、メタ認知的な指示を与えることでその弱点は克服可能であり、推論の正確さはモデルの大きさや思考深度だけではなく、プロンプト設計の工夫にも強く依存することが確認できました。
gpt-oss-120b は単に「答えを出す道具」と見るのではなく、「AIの出力過程を人間がデザインできる実験的な推論装置」として捉えるべきでもあるのでしょう。gpt-oss はReasoningの中身も確認できるため、教育や研究の場面において、理由や過程を明示させたいときに特に有用なモデルとなり得ます。

それにしても、1200億パラメータ規模のモデルがオープンソースとして公開され、ここまでの性能を発揮していること自体が大きな驚きです。クローズドなモデルに依存せずに、強力な推論過程を利用できるという点で、研究・応用の観点からも「革命的」と評して差し支えないように思えます。

今後も、このオープンソースモデルに関する研究が進み、様々な改良モデルが公開されていくと考えられます。AI界隈の更なる発展がますます楽しみです。AI界のバルクアップ^※の予感がしますね・・！

ではまた！

gpt-oss-120bを試してみた

gpt-ossとは？

Reasoning Modelとは？

さぁ、やってみよう

タスク① 状況推理

タスク② 引っ掛け問題でのテスト

考察

まとめ

RELATED POST

AIで出店の羅針盤を手に入れる！商圏ポテンシャルの評価をシンプルにする「エリアスコアリング」活用法【後編】

AIで出店の羅針盤を手に入れる！商圏ポテンシャルの評価をシンプルにする「エリアスコアリング」活用法【中編】

gpt-ossが拓く未来

LLMでトピック推定どうやる問題の正解を探る