隨著在線社交媒體和評論平臺的激增,大量的意見數(shù)據(jù)被記錄下來,具有支持決策過程的巨大潛力。情感分析研究人們在其生成的文本中的情感,例如產(chǎn)品評論、博客評論和論壇討論。它在政治(例如,公眾對政策的情緒分析)、金融(例如,市場情緒分析)和市場營銷(例如,產(chǎn)品研究和品牌管理)等領(lǐng)域有著廣泛的應(yīng)用。
由于情緒可以被分類為離散的極性或尺度(例如,積極和消極),我們可以將情緒分析視為文本分類任務(wù),它將可變長度的文本序列轉(zhuǎn)換為固定長度的文本類別。在本章中,我們將使用斯坦福的大型電影評論數(shù)據(jù)集進行情感分析。它由一個訓(xùn)練集和一個測試集組成,其中包含從 IMDb 下載的 25000 條電影評論。在這兩個數(shù)據(jù)集中,“正面”和“負面”標簽的數(shù)量相等,表明不同的情緒極性。
import os import torch from torch import nn from d2l import torch as d2l
import os from mxnet import np, npx from d2l import mxnet as d2l npx.set_np()
16.1.1。讀取數(shù)據(jù)集
首先,在路徑中下載并解壓這個 IMDb 評論數(shù)據(jù)集 ../data/aclImdb。
#@save d2l.DATA_HUB['aclImdb'] = (d2l.DATA_URL + 'aclImdb_v1.tar.gz', '01ada507287d82875905620988597833ad4e0903') data_dir = d2l.download_extract('aclImdb', 'aclImdb')
Downloading ../data/aclImdb_v1.tar.gz from http://d2l-data.s3-accelerate.amazonaws.com/aclImdb_v1.tar.gz...
#@save d2l.DATA_HUB['aclImdb'] = (d2l.DATA_URL + 'aclImdb_v1.tar.gz', '01ada507287d82875905620988597833ad4e0903') data_dir = d2l.download_extract('aclImdb', 'aclImdb')
Downloading ../data/aclImdb_v1.tar.gz from http://d2l-data.s3-accelerate.amazonaws.com/aclImdb_v1.tar.gz...
接下來,閱讀訓(xùn)練和測試數(shù)據(jù)集。每個示例都是評論及其標簽:1 表示“正面”,0 表示“負面”。
#@save def read_imdb(data_dir, is_train): """Read the IMDb review dataset text sequences and labels.""" data, labels = [], [] for label in ('pos', 'neg'): folder_name = os.path.join(data_dir, 'train' if is_train else 'test', label) for file in os.listdir(folder_name): with open(os.path.join(folder_name, file), 'rb') as f: review = f.read().decode('utf-8').replace('n', '') data.append(review) labels.append(1 if label == 'pos' else 0) return data, labels train_data = read_imdb(data_dir, is_train=True) print('# trainings:', len(train_data[0])) for x, y in zip(train_data[0][:3], train_data[1][:3]): print('label:', y, 'review:', x[:60])
# trainings: 25000 label: 1 review: Henry Hathaway was daring, as well as enthusiastic, for his label: 1 review: An unassuming, subtle and lean film, "The Man in the White S label: 1 review: Eddie Murphy really made me laugh my ass off on this HBO sta
#@save def read_imdb(data_dir, is_train): """Read the IMDb review dataset text sequences and labels.""" data, labels = [], [] for label in ('pos', 'neg'): folder_name = os.path.join(data_dir, 'train' if is_train else 'test', label) for file in os.listdir(folder_name): with open(os.path.join(folder_name, file), 'rb') as f: review = f.read().decode('utf-8').replace('n', '') data.append(review) labels.append(1 if label == 'pos' else 0) return data, labels train_data = read_imdb(data_dir, is_train=True) print('# trainings:', len(train_data[0])) for x, y in zip(train_data[0][:3], train_data[1][:3]): print('label:', y, 'review:', x[:60])
# trainings: 25000 label: 1 review: Henry Hathaway was daring, as well as enthusiastic, for his label: 1 review: An unassuming, subtle and lean film, "The Man in the White S label: 1 review: Eddie Murphy really made me laugh my ass off on this HBO sta
16.1.2。預(yù)處理數(shù)據(jù)集
將每個單詞視為一個標記并過濾掉出現(xiàn)次數(shù)少于 5 次的單詞,我們從訓(xùn)練數(shù)據(jù)集中創(chuàng)建了一個詞匯表。
train_tokens = d2l.tokenize(train_data[0], token='word') vocab = d2l.Vocab(train_tokens, min_freq=5, reserved_tokens=[''])
train_tokens = d2l.tokenize(train_data[0], token='word') vocab = d2l.Vocab(train_tokens, min_freq=5, reserved_tokens=[''])
標記化后,讓我們繪制以標記為單位的評論長度直方圖。
d2l.set_figsize() d2l.plt.xlabel('# tokens per review') d2l.plt.ylabel('count') d2l.plt.hist([len(line) for line in train_tokens], bins=range(0, 1000, 50));
d2l.set_figsize() d2l.plt.xlabel('# tokens per review') d2l.plt.ylabel('count') d2l.plt.hist([len(line) for line in train_tokens], bins=range(0, 1000, 50));
正如我們所料,評論的長度各不相同。為了每次處理一小批此類評論,我們將每個評論的長度設(shè)置為 500,并進行截斷和填充,這類似于第 10.5 節(jié)中機器翻譯數(shù)據(jù)集的預(yù)處理 步驟。
num_steps = 500 # sequence length train_features = torch.tensor([d2l.truncate_pad( vocab[line], num_steps, vocab['']) for line in train_tokens]) print(train_features.shape)
torch.Size([25000, 500])
num_steps = 500 # sequence length train_features = np.array([d2l.truncate_pad( vocab[line], num_steps, vocab['']) for line in train_tokens]) print(train_features.shape)
(25000, 500)
16.1.3。創(chuàng)建數(shù)據(jù)迭代器
現(xiàn)在我們可以創(chuàng)建數(shù)據(jù)迭代器。在每次迭代中,返回一小批示例。
train_iter = d2l.load_array((train_features, torch.tensor(train_data[1])), 64) for X, y in train_iter: print('X:', X.shape, ', y:', y.shape) break print('# batches:', len(train_iter))
X: torch.Size([64, 500]) , y: torch.Size([64]) # batches: 391
train_iter = d2l.load_array((train_features, train_data[1]), 64) for X, y in train_iter: print('X:', X.shape, ', y:', y.shape) break print('# batches:', len(train_iter))
X: (64, 500) , y: (64,) # batches: 391
16.1.4。把它們放在一起
最后,我們將上述步驟包裝到函數(shù)中l(wèi)oad_data_imdb。它返回訓(xùn)練和測試數(shù)據(jù)迭代器以及 IMDb 評論數(shù)據(jù)集的詞匯表。
#@save def load_data_imdb(batch_size, num_steps=500): """Return data iterators and the vocabulary of the IMDb review dataset.""" data_dir = d2l.download_extract('aclImdb', 'aclImdb') train_data = read_imdb(data_dir, True) test_data = read_imdb(data_dir, False) train_tokens = d2l.tokenize(train_data[0], token='word') test_tokens = d2l.tokenize(test_data[0], token='word') vocab = d2l.Vocab(train_tokens, min_freq=5) train_features = torch.tensor([d2l.truncate_pad( vocab[line], num_steps, vocab['']) for line in train_tokens]) test_features = torch.tensor([d2l.truncate_pad( vocab[line], num_steps, vocab['']) for line in test_tokens]) train_iter = d2l.load_array((train_features, torch.tensor(train_data[1])), batch_size) test_iter = d2l.load_array((test_features, torch.tensor(test_data[1])), batch_size, is_train=False) return train_iter, test_iter, vocab
#@save def load_data_imdb(batch_size, num_steps=500): """Return data iterators and the vocabulary of the IMDb review dataset.""" data_dir = d2l.download_extract('aclImdb', 'aclImdb') train_data = read_imdb(data_dir, True) test_data = read_imdb(data_dir, False) train_tokens = d2l.tokenize(train_data[0], token='word') test_tokens = d2l.tokenize(test_data[0], token='word') vocab = d2l.Vocab(train_tokens, min_freq=5) train_features = np.array([d2l.truncate_pad( vocab[line], num_steps, vocab['']) for line in train_tokens]) test_features = np.array([d2l.truncate_pad( vocab[line], num_steps, vocab['']) for line in test_tokens]) train_iter = d2l.load_array((train_features, train_data[1]), batch_size) test_iter = d2l.load_array((test_features, test_data[1]), batch_size, is_train=False) return train_iter, test_iter, vocab
16.1.5。概括
情感分析研究人們在其生成的文本中的情感,這被認為是將變長文本序列轉(zhuǎn)換為固定長度文本類別的文本分類問題。
預(yù)處理后,我們可以將斯坦福的大型電影評論數(shù)據(jù)集(IMDb 評論數(shù)據(jù)集)加載到帶有詞匯表的數(shù)據(jù)迭代器中。
16.1.6。練習(xí)
我們可以修改本節(jié)中的哪些超參數(shù)來加速訓(xùn)練情緒分析模型?
你能實現(xiàn)一個函數(shù)來將亞馬遜評論的數(shù)據(jù)集加載到數(shù)據(jù)迭代器和標簽中以進行情感分析嗎?
-
數(shù)據(jù)集
+關(guān)注
關(guān)注
4文章
1205瀏覽量
24648 -
pytorch
+關(guān)注
關(guān)注
2文章
803瀏覽量
13151
發(fā)布評論請先 登錄
相關(guān)推薦
評論