i have dataset of ratings of photographs, generated many individual raters.
each rater given several images rate, , each image, rater provides several different ratings, plus description.
so, example, rater might asked rate 3 different photos, , provide separate ratings on 0-5 scale how happy, sad, , interesting each photo is. in addition, rater asked provide short text description of each photo.
the same photograph may rated multiple raters, not photographs have been rated same number of raters.
currently data in form (each url represents photo):
rater_id | url1 | url2 | url3 | rating_1(1) | rating_2(1) | rating_1(2) | rating_2(2) | rating_1(3) | rating_2(3) | description(1) | description(2) | description(3) 001 | | b | c | 3.0 | 2.5 | 4.0 | 1.5 | 5.0 | 5.0 | sunny day | rainy day | foggy day 002 | | b | d | 1.0 | 4.5 | 3.0 | 3.5 | 1.0 | 3.5 | sunshine | rain | snow
i have few transformations i'm trying achieve here.
first want change data frame indexed photo url - of url fields (url1, url2, etc) melted 1 long column url
. meaning:
url1 | url2 | url3 -----|------|----- | b | c
becomes
url --- b c
within each url
index, there nested index rater_id
, has row of rater's ratings , description given photo.
eg.:
| rating_1 | rating_2 | description url | rater_id | 001 | 3.0 | 2.5 | sunny day | 002 | 1.0 | 4.5 | sunshine ----|-----------|----------|----------|------------ b | 001 | 4.0 | 1.5 | rainy day | 002 | 4.5 | 3.0 | rain ----|-----------|----------|----------|------------ c | 001 | 5.0 | 5.0 | foggy day ----|-----------|----------|----------|------------ d | 002 | 1.0 | 3.5 | snow
finally, want aggregate ratings , descriptions each photo url: - mean , variance of numeric ratings - tab-separated string of descriptions - number of raters have rated each photo
eg.:
url | rating_1_avg | rating_1_var | rating_2_avg | rating_2_var | all_descriptions | total_ratings | 2.0 | 2.0 | 3.0 | 2.0 | sunny day sunshine | 2 ----|--------------|--------------|--------------|--------------|-----------------------|-------------- b | 4.25 | 0.125 | 2.25 | 1.125 | rainy day rain | 2 ----|--------------|--------------|--------------|--------------|-----------------------|-------------- c | 5.0 | na | 5.0 | na | foggy day | 1 ----|--------------|--------------|--------------|--------------|-----------------------|-------------- d | 1.0 | na | 3.5 | na | snow | 1
i've tried number of approaches pandas reshaping tools, including melt
, wide_to_long
, can't figure out how first photo urls long format, , created nested index arrange data i've presented above. i'm pretty comfortable pandas groupby
, basic aggregation, bit beyond skill level. appreciated!
note: fields i've given in these dummy data not exact names in actual dataset, follow same naming conventions exactly. photo urls url1
, url2
, etc, , ratings fields denoted rating_<rating_category_number>(<url_number>)
, eg. rating_1(2)
. description fields denoted description(<url_number>)
, eg. description(2)
.
here's python code build initial dataset:
df = pd.dataframe({'id': {0: '001', 1: '002'}, 'url1': {0: 'a', 1: 'a'}, 'url2': {0: 'b', 1: 'b'}, 'url3': {0: 'c', 1: 'd'}}) df['rating_1(1)'] = [3.0, 1] df['rating_2(1)'] = [2.5, 4.5] df['rating_1(2)'] = [4.0, 3] df['rating_2(2)'] = [1.5, 3.5] df['rating_1(3)'] = [5.0, 1] df['rating_2(3)'] = [5.0, 3.5] df['description(1)'] = ['sunny day','sunshine'] df['description(2)'] = ['rainy day','rain'] df['description(3)'] = ['foggy day','snow']
you can first find columns each category str.contains
, use not known pd.lreshape
. last aggregate columns mean
, var
, count
, join
:
#select columns each category rat1 = df.columns[df.columns.str.contains(r'rating_1')].tolist() print rat1 ['rating_1(1)', 'rating_1(2)', 'rating_1(3)'] rat2 = df.columns[df.columns.str.contains(r'rating_2')].tolist() url = df.columns[df.columns.str.contains(r'url')].tolist() desc = df.columns[df.columns.str.contains(r'description')].tolist() df = pd.lreshape(df, {'rat1': rat1, 'rat2': rat2,'url': url,'desc': desc}) print df rater_id url rat2 rat1 desc 0 '001' 2.5 3.0 sunny day 1 '002' 4.5 1.0 sunshine 2 '001' b 1.5 4.0 rainy day 3 '002' b 3.5 3.0 rain 4 '001' c 5.0 5.0 foggy day 5 '002' d 3.5 1.0 snow #aggregate df = df.groupby(['url']).agg({'rat1':['mean', 'var'], 'rat2':['mean', 'var'], 'desc': ' '.join, 'rater_id': 'count'}) #reset multiindex in columns df.columns = ['_'.join(col) col in df.columns.values]
print df rater_id_count rat2_mean rat2_var rat1_mean rat1_var \ url 2 3.5 2.0 2.0 2.0 b 2 2.5 2.0 3.5 0.5 c 1 5.0 nan 5.0 nan d 1 3.5 nan 1.0 nan desc_join url sunny day sunshine b rainy day rain c foggy day d snow
Comments
Post a Comment