羊駝系列大模型和ChatGPT差多少？詳細(xì)測評(píng)后，我沉默了

2023-05-15 10:22:49來源：機(jī)器之心

前段時(shí)間，谷歌的一份泄密文件引發(fā)了廣泛關(guān)注。在這份文件中，一位谷歌內(nèi)部的研究人員表達(dá)了一個(gè)重要觀點(diǎn)：谷歌沒有護(hù)城河，OpenAI 也沒有。

這位研究人員表示，雖然表面看起來 OpenAI 和谷歌在 AI 大模型上你追我趕，但真正的贏家未必會(huì)從這兩家中產(chǎn)生，因?yàn)橐粋€(gè)第三方力量正在悄悄崛起。

(相關(guān)資料圖)

這個(gè)力量名叫「開源」。圍繞 Meta 的 LLaMA 等開源模型，整個(gè)社區(qū)正在迅速構(gòu)建與 OpenAI、谷歌大模型能力類似的模型，而且開源模型的迭代速度更快，可定制性更強(qiáng)，更有私密性……「當(dāng)免費(fèi)的、不受限制的替代品質(zhì)量相當(dāng)時(shí)，人們不會(huì)為受限制的模型付費(fèi)。」作者寫道。

這些觀點(diǎn)在社交媒體上引起了很大爭議，其中一個(gè)比較大的爭議是：那些開源模型是否真的能達(dá)到和 OpenAI ChatGPT 或谷歌 Bard 等商業(yè)閉源大模型相似的水平？現(xiàn)階段兩個(gè)陣營還有多大差距？

為了探索這個(gè)問題，一位名叫 Marco Tulio Ribeiro 的 Medium 博主在一些復(fù)雜任務(wù)上對部分模型（Vicuna-13B、MPT-7b-Chat VS. ChatGPT 3.5）進(jìn)行了測試。

其中，Vicuna-13B 是加州大學(xué)伯克利分校、卡內(nèi)基梅隆大學(xué)、斯坦福大學(xué)、加州大學(xué)圣迭戈分校的研究者們提出的一個(gè)開源模型，這個(gè)模型基于 LLaMA 13B 參數(shù)量的版本構(gòu)建而成，在一項(xiàng)由 GPT-4 打分的測試中表現(xiàn)十分亮眼（參見《300 美元復(fù)刻 ChatGPT 九成功力，GPT-4 親自監(jiān)考，130 億參數(shù)開源模型「小羊駝」來了》）。

MPT-7B 是 MosaicML 發(fā)布的大型語言模型，遵循 meta 的 LLaMA 模型的訓(xùn)練方案。MosaicML 表示，MPT-7B 與 meta 的 70 億參數(shù) LLaMA 模型的性能相當(dāng)。

和它們對比的，自然是大語言模型標(biāo)桿 ChatGPT。

Marco Tulio Ribeiro 是一位研究員，目前在微軟研究院的自適應(yīng)系統(tǒng)和交互組工作。他還是華盛頓大學(xué)的聯(lián)合助理教授。這項(xiàng)工作由他和微軟的另一位研究員 Scott Lundberg 共同完成。在測試中，他們使用了微軟的 guidance 庫來幫助設(shè)計(jì) prompt。

熱身：解方程

第一項(xiàng)任務(wù)是解簡單的多項(xiàng)式方程，這些問題都有標(biāo)準(zhǔn)答案，比較容易評(píng)估對錯(cuò)。

對于指定的三個(gè)模型，測試者給出的題目是求二元一次方程「x^2+3x=0」的解。他們使用了以下 prompt：

三個(gè)模型表現(xiàn)如下。

ChatGPT:

equation = "x^2 + 3.0x = 0"roots = [0, -3]answer_gpt = find_roots (llm=chatgpt, equatinotallow=equation)

Vicuna：

answer_vicuna = find_roots (llm=vicuna, equatinotallow=equation)

MPT:

answer_mpt = find_roots (llm=mpt, equatinotallow=equation)

顯然，正確答案應(yīng)該是 [-3, 0]，只有 ChatGPT 答對了（Vicuna 甚至沒有按照指定的格式作答）。

在這篇文章附帶的 notebook 中，測試者編寫了一個(gè)函數(shù)，用于生成具有整數(shù)根的隨機(jī)二次方程，根的范圍在 - 20 到 20 之間，并且對每個(gè)模型運(yùn)行了 20 次 prompt。三個(gè)模型的準(zhǔn)確率結(jié)果如下：

╔═══════════╦══════════╦║   Model   ║ Accuracy ║     ╠═══════════╬══════════╬║ ChatGPT   ║   80%    ║║ Vicuna    ║    0%    ║ ║ MPT       ║    0%    ║╚═══════════╩══════════╩

在二元一次方程的測試中，雖然 GPT 做錯(cuò)了一些題，但 Vicuna 和 MPT 一道都沒做對，經(jīng)常在中間步驟中犯錯(cuò)（MPT 甚至經(jīng)常不寫中間步驟）。下面是一個(gè) ChatGPT 錯(cuò)誤的例子：

ChatGPT 在最后一步計(jì)算錯(cuò)誤，(13 +- 25)/2 應(yīng)該得到 [19，-6] 而不是 [19.5，-6.5]。

由于 Vicuna 和 MPT 實(shí)在不會(huì)解二元一次方程，測試者就找了一些更簡單的題讓他們做，比如 x-10=0。對于這些簡單的方程，他們得到了以下統(tǒng)計(jì)結(jié)果：

╔═══════════╦══════════╦║   Model   ║ Accuracy ║     ╠═══════════╬══════════╬║ ChatGPT   ║   100%   ║║ Vicuna    ║    85%   ║ ║ MPT       ║    30%   ║╚═══════════╩══════════╩

下面是一個(gè) MPT 答錯(cuò)的例子：

結(jié)論

在這個(gè)非常簡單的測試中，測試者使用相同的問題、相同的 prompt 得出的結(jié)論是：ChatGPT 在準(zhǔn)確性方面遠(yuǎn)遠(yuǎn)超過了 Vicuna 和 MPT。

任務(wù)：提取片段 + 回答會(huì)議相關(guān)的問題

這個(gè)任務(wù)更加現(xiàn)實(shí)，而且在會(huì)議相關(guān)的問答中，出于安全性、隱私等方面考慮，大家可能更加傾向于用開源模型，而不是將私有數(shù)據(jù)發(fā)送給 OpenAI。

以下是一段會(huì)議記錄（翻譯結(jié)果來自 DeepL，僅供參考）：

測試者給出的第一個(gè)測試問題是：「Steven 如何看待收購一事？」，prompt 如下：

qa_attempt1 = guidance ("""{{#system~}}{{llm.default_system_prompt}}{{~/system}}{{#user~}}You will read a meeting transcript, then extract the relevant segments to answer the following question:Question: {{query}}Here is a meeting transcript:----{{transcript}}----Please answer the following question:Question: {{query}}Extract from the transcript the most relevant segments for the answer, and then answer the question.{{/user}}{{#assistant~}}{{gen "answer"}}{{~/assistant~}}""")

ChatGPT 給出了如下答案：

雖然這個(gè)回答是合理的，但 ChatGPT 并沒有提取任何對話片段作為答案的支撐（因此不符合測試者設(shè)定的規(guī)范）。測試者在 notebook 中迭代了 5 個(gè)不同的 prompt，以下是一些例子：

qa_attempt3 = guidance ("""{{#system~}}{{llm.default_system_prompt}}{{~/system}}{{#user~}}You will read a meeting transcript, then extract the relevant segments to answer the following question:Question: {{query}}Here is a meeting transcript:----{{transcript}}----Based on the above, please answer the following question:Question: {{query}}Please extract from the transcript whichever conversation segments are most relevant for the answer, and then answer the question.Note that conversation segments can be of any length, e.g. including multiple conversation turns.Please extract at most 3 segments. If you need less than three segments, you can leave the rest blank.As an example of output format, here is a fictitious answer to a question about another meeting transcript.CONVERSATION SEGMENTS:Segment 1: Peter and John discuss the weather.Peter: John, how is the weather today?John: It"s raining.Segment 2: Peter insults JohnPeter: John, you are a bad person.Segment 3: BlankANSWER: Peter and John discussed the weather and Peter insulted John.{{/user}}{{#assistant~}}{{gen "answer"}}{{~/assistant~}}""")

在這個(gè)新的 prompt 中，ChatGPT 確實(shí)提取了相關(guān)的片段，但它沒有遵循測試者規(guī)定的輸出格式（它沒有總結(jié)每個(gè)片段，也沒有給出對話者的名字）。

不過，在構(gòu)建出更復(fù)雜的 prompt 之后，ChatGPT 終于聽懂了指示：

qa_attempt5 = guidance ("""{{#system~}}{{llm.default_system_prompt}}{{~/system}}{{#user~}}You will read a meeting transcript, then extract the relevant segments to answer the following question:Question: What were the main things that happened in the meeting?Here is a meeting transcript:----Peter: HeyJohn: HeyPeter: John, how is the weather today?John: It"s raining.Peter: That"s too bad. I was hoping to go for a walk later.John: Yeah, it"s a shame.Peter: John, you are a bad person.----Based on the above, please answer the following question:Question: {{query}}Please extract from the transcript whichever conversation segments are most relevant for the answer, and then answer the question.Note that conversation segments can be of any length, e.g. including multiple conversation turns.Please extract at most 3 segments. If you need less than three segments, you can leave the rest blank.{{/user}}{{#assistant~}}CONVERSATION SEGMENTS:Segment 1: Peter and John discuss the weather.Peter: John, how is the weather today?John: It"s raining.Segment 2: Peter insults JohnPeter: John, you are a bad person.Segment 3: BlankANSWER: Peter and John discussed the weather and Peter insulted John.{{~/assistant~}}{{#user~}}You will read a meeting transcript, then extract the relevant segments to answer the following question:Question: {{query}}Here is a meeting transcript:----{{transcript}}----Based on the above, please answer the following question:Question: {{query}}Please extract from the transcript whichever conversation segments are most relevant for the answer, and then answer the question.Note that conversation segments can be of any length, e.g. including multiple conversation turns.Please extract at most 3 segments. If you need less than three segments, you can leave the rest blank.{{~/user}}{{#assistant~}}{{gen "answer"}}{{~/assistant~}}""")

測試者表示，他們之所以要多次迭代 prompt，是因?yàn)?OpenAI API 不允許他們做部分輸出補(bǔ)全（即他們不能指定 AI 助手如何開始回答），因此他們很難引導(dǎo)輸出。

相反，如果使用一個(gè)開源模型，他們就可以更清楚地指導(dǎo)輸出，迫使模型使用他們規(guī)定的結(jié)構(gòu)。

新一輪測試使用如下 prompt：

qa_guided = guidance ("""{{#system~}}{{llm.default_system_prompt}}{{~/system}}{{#user~}}You will read a meeting transcript, then extract the relevant segments to answer the following question:Question: {{query}}----{{transcript}}----Based on the above, please answer the following question:Question: {{query}}Please extract the three segment from the transcript that are the most relevant for the answer, and then answer the question.Note that conversation segments can be of any length, e.g. including multiple conversation turns. If you need less than three segments, you can leave the rest blank.As an example of output format, here is a fictitious answer to a question about another meeting transcript:CONVERSATION SEGMENTS:Segment 1: Peter and John discuss the weather.Peter: John, how is the weather today?John: It"s raining.Segment 2: Peter insults JohnPeter: John, you are a bad person.Segment 3: BlankANSWER: Peter and John discussed the weather and Peter insulted John.{{/user}}{{#assistant~}}CONVERSATION SEGMENTS:Segment 1: {{gen"segment1"}}Segment 2: {{gen"segment2"}}Segment 3: {{gen"segment3"}}ANSWER: {{gen "answer"}}{{~/assistant~}}""")

如果用 Vicuna 運(yùn)行上述 prompt，他們第一次就會(huì)得到正確的格式，而且格式總能保持正確：

當(dāng)然，也可以在 MPT 上運(yùn)行相同的 prompt：

雖然 MPT 遵循了格式要求，但它沒有針對給定的會(huì)議資料回答問題，而是從格式示例中提取了片段。這顯然是不行的。

接下來比較 ChatGPT 和 Vicuna。

測試者給出的問題是「誰想賣掉公司？」兩個(gè)模型看起來答得都不錯(cuò)。

以下是 ChatGPT 的回答：

以下是 Vicuna 的回答：

接下來，測試者換了一段材料。新材料是馬斯克和記者的一段對話：

測試者提出的問題是：「Elon Musk 有沒有侮辱（insult）記者？」

ChatGPT 給出的答案是：

Vicuna 給出的答案是：

Vicuna 給出了正確的格式，甚至提取的片段也是對的。但令人意外的是，它最后還是給出了錯(cuò)誤的答案，即「Elon musk does not accuse him of lying or insult him in any way」。

測試者還進(jìn)行了其他問答測試，得出的結(jié)論是：Vicuna 在大多數(shù)問題上與 ChatGPT 相當(dāng)，但比 ChatGPT 更經(jīng)常答錯(cuò)。

用 bash 完成任務(wù)

測試者嘗試讓幾個(gè) LLM 迭代使用 bash shell 來解決一些問題。每當(dāng)模型發(fā)出命令，測試者會(huì)運(yùn)行這些命令并將輸出插入到 prompt 中，迭代進(jìn)行這個(gè)過程，直到任務(wù)完成。

ChatGPT 的 prompt 如下所示：

terminal = guidance ("""{{#system~}}{{llm.default_system_prompt}}{{~/system}}{{#user~}}Please complete the following task:Task: list the files in the current directoryYou can give me one bash command to run at a time, using the syntax:COMMAND: commandI will run the commands on my terminal, and paste the output back to you. Once you are done with the task, please type DONE.{{/user}}{{#assistant~}}COMMAND: ls{{~/assistant~}}{{#user~}}Output: guidance project{{/user}}{{#assistant~}}The files or folders in the current directory are:- guidance- projectDONE{{~/assistant~}}{{#user~}}Please complete the following task:Task: {{task}}You can give me one bash command to run at a time, using the syntax:COMMAND: commandI will run the commands on my terminal, and paste the output back to you. Once you are done with the task, please type DONE.{{/user}}{{#geneach "commands" stop=False}}{{#assistant~}}{{gen "this.command"}}{{~/assistant~}}{{~#user~}}Output: {{shell this.command)}}{{~/user~}}{{/geneach}}""")

測試者在～/work/project 中創(chuàng)建了一個(gè)虛擬存儲(chǔ)庫，其中包含文件 license.txt，但不是標(biāo)準(zhǔn)的 LICENSE 文件名。

然后測試者嘗試在不與 ChatGPT 溝通的情況下，看它是否能完成任務(wù) ——「找出位于～/work/project 中的開源項(xiàng)目正在使用的 license」（Find out what license the open source project located in ~/work/project is using）。

ChatGPT 遵循一個(gè)非常自然的順序，并解決了這個(gè)問題。

對于開源模型，測試者編寫了一個(gè)更簡單的（引導(dǎo)式）prompt，其中包含一系列命令輸出：

guided_terminal = guidance ("""{{#system~}}{{llm.default_system_prompt}}{{~/system}}{{#user~}}Please complete the following task:Task: list the files in the current directoryYou can run bash commands using the syntax:COMMAND: commandOUTPUT: outputOnce you are done with the task, use the COMMAND: DONE.{{/user}}{{#assistant~}}COMMAND: lsOUTPUT: guidance projectCOMMAND: DONE {{~/assistant~}}{{#user~}}Please complete the following task:Task: {{task}}You can run bash commands using the syntax:COMMAND: commandOUTPUT: outputOnce you are done with the task, use the COMMAND: DONE.{{~/user}}{{#assistant~}}{{#geneach "commands" stop=False ~}}COMMAND: {{gen "this.command" stop="\\n"}}OUTPUT: {{shell this.command)}}{{~/geneach}}{{~/assistant~}}""")

我們來看一下 Vicuna 和 MPT 執(zhí)行該任務(wù)的情況。

Vicuna：

MPT：

在一個(gè)有趣的轉(zhuǎn)折中，Vicuna 無法解決這個(gè)任務(wù)，但 MPT 卻成功了。除了保密性之外，開源模型在這里有一個(gè)顯著的優(yōu)勢：整個(gè) prompt 被作為一個(gè)輸入傳遞給一個(gè) LLM 模型（測試者甚至通過不讓它生成像 COMMAND 這樣的輸出結(jié)構(gòu) token 來加速它）。

相比之下，他們必須為每個(gè)命令重新調(diào)用 ChatGPT，這更慢，開銷也更大。

接下來，他們又嘗試了一個(gè)不同的命令：「在～/work/guidance 目錄下找到當(dāng)前未被 git 跟蹤的所有 jupyter notebook 文件」

以下是 ChatGPT 的回答：

測試者再次遇到一個(gè)問題：ChatGPT 沒有遵循他們指定的輸出結(jié)構(gòu)（這樣就使得它無法在無人干預(yù)的情況下在程序內(nèi)使用）。該程序只是執(zhí)行命令，因此在上面最后一條 ChatGPT 信息之后就停止了。

測試者懷疑空輸出會(huì)導(dǎo)致 ChatGPT 關(guān)閉，因此他們通過在沒有輸出時(shí)更改信息來解決這個(gè)特殊問題。然而，他們無法解決「無法強(qiáng)迫 ChatGPT 遵循指定的輸出結(jié)構(gòu)」這一普遍問題。

在做了這個(gè)小小的修改后，ChatGPT 就能解決這個(gè)問題：讓我們看看 Vicuna 是怎么做的：

Vicuna 遵循了輸出結(jié)構(gòu)，但不幸的是，它運(yùn)行了錯(cuò)誤的命令來完成任務(wù)。MPT 反復(fù)調(diào)用 git status，所以它也失敗了。

測試者還對其他各種指令運(yùn)行了這些程序，發(fā)現(xiàn) ChatGPT 幾乎總是能產(chǎn)生正確的指令序列，但有時(shí)并不遵循指定的格式（因此需要人工干預(yù)）。此處開源模型的效果不是很好（或許可以通過更多的 prompt 工程來改進(jìn)它們，但它們在大多數(shù)較難的指令上都失敗了）。

歸納總結(jié)

測試者還嘗試了一些其他任務(wù)，包括文本摘要、問題回答、創(chuàng)意生成和 toy 字符串操作，評(píng)估了幾種模型的準(zhǔn)確性。以下是主要的評(píng)估結(jié)果：

任務(wù)質(zhì)量：對于每項(xiàng)任務(wù)，ChatGPT (3.5) 都比 Vicuna 強(qiáng)，而 MPT 幾乎在所有任務(wù)上都表現(xiàn)不佳，這甚至讓測試團(tuán)隊(duì)?wèi)岩勺约旱氖褂梅椒ù嬖趩栴}。值得注意的是，Vicuna 的性能通常接近 ChatGPT。易用性：ChatGPT 很難遵循指定的輸出格式，因此難以在程序中使用它，需要為輸出編寫正則表達(dá)式解析器。相比之下，能夠指定輸出結(jié)構(gòu)是開源模型的一個(gè)顯著優(yōu)勢，以至于有時(shí) Vicuna 比 ChatGPT 更易用，即使它在任務(wù)性能方面更差一些。效率：本地部署模型意味著我們可以在單次 LLM 運(yùn)行中解決任務(wù)（guidance 在程序執(zhí)行時(shí)保持 LLM 狀態(tài)），速度更快，成本更低。當(dāng)任何子步驟涉及調(diào)用其他 API 或函數(shù)（例如搜索、終端等）時(shí)尤其如此，這總是需要對 OpenAI API 進(jìn)行新調(diào)用。guidance 還通過不讓模型生成輸出結(jié)構(gòu)標(biāo)記來加速生成，這有時(shí)會(huì)產(chǎn)生很大的不同。

總的來說，該測試得出的結(jié)論是：MPT 還沒有準(zhǔn)備好在現(xiàn)實(shí)世界中使用，而 Vicuna 對于許多任務(wù)來說是 ChatGPT (3.5) 的可行替代品。目前這些發(fā)現(xiàn)僅適用于該測試嘗試的任務(wù)和輸入（或 prompt 類型），該測試只是一個(gè)初步探索，而不是正式評(píng)估。

更多結(jié)果參見 notebook：https://github.com/microsoft/guidance/blob/main/notebooks/chatgpt_vs_open_source_on_harder_tasks.ipynb

關(guān)鍵詞：

相關(guān)新聞

欧美日韩中文字幕一区二区高清_日本—中文字幕一级A片_2022国产高清精品一区二区_国产亚洲曝欧美_国产精品77777竹菊影视

羊駝系列大模型和ChatGPT差多少？詳細(xì)測評(píng)后，我沉默了

羊駝系列大模型和ChatGPT差多少？詳細(xì)測評(píng)后，我沉默了