python - Pandas: Reshape wide dataframe to multi-indexed long -


i have dataset of ratings of photographs, generated many individual raters.
each rater given several images rate, , each image, rater provides several different ratings, plus description.

so, example, rater might asked rate 3 different photos, , provide separate ratings on 0-5 scale how happy, sad, , interesting each photo is. in addition, rater asked provide short text description of each photo.

the same photograph may rated multiple raters, not photographs have been rated same number of raters.

currently data in form (each url represents photo):

rater_id | url1 | url2 | url3 | rating_1(1) | rating_2(1) | rating_1(2) | rating_2(2)   | rating_1(3) | rating_2(3) | description(1) | description(2) | description(3)      001 |    |   b  |   c  |     3.0     |     2.5     |     4.0     |     1.5       |     5.0     |     5.0     |  sunny day     |  rainy day     |  foggy day      002 |    |   b  |   d  |     1.0     |     4.5     |     3.0     |     3.5       |     1.0     |     3.5     |  sunshine      |  rain          |  snow 

i have few transformations i'm trying achieve here.
first want change data frame indexed photo url - of url fields (url1, url2, etc) melted 1 long column url. meaning:

url1 | url2 | url3 -----|------|-----    |   b  |  c 

becomes

url ---   b  c 

within each url index, there nested index rater_id, has row of rater's ratings , description given photo.
eg.:

                | rating_1 | rating_2 | description url | rater_id   |      001  |    3.0   |   2.5    |  sunny day     |      002  |    1.0   |   4.5    |  sunshine ----|-----------|----------|----------|------------  b  |      001  |    4.0   |   1.5    |  rainy day     |      002  |    4.5   |   3.0    |  rain ----|-----------|----------|----------|------------  c  |      001  |    5.0   |   5.0    |  foggy day ----|-----------|----------|----------|------------  d  |      002  |    1.0   |   3.5    |  snow 

finally, want aggregate ratings , descriptions each photo url: - mean , variance of numeric ratings - tab-separated string of descriptions - number of raters have rated each photo

eg.:

url | rating_1_avg | rating_1_var | rating_2_avg | rating_2_var | all_descriptions      | total_ratings    |     2.0      |     2.0      |     3.0      |     2.0      | sunny day    sunshine |      2 ----|--------------|--------------|--------------|--------------|-----------------------|--------------  b  |     4.25     |     0.125    |     2.25     |     1.125    | rainy day    rain     |      2 ----|--------------|--------------|--------------|--------------|-----------------------|--------------  c  |     5.0      |     na       |     5.0      |     na       | foggy day             |      1 ----|--------------|--------------|--------------|--------------|-----------------------|--------------  d  |     1.0      |     na       |     3.5      |     na       | snow                  |      1 

i've tried number of approaches pandas reshaping tools, including melt , wide_to_long, can't figure out how first photo urls long format, , created nested index arrange data i've presented above. i'm pretty comfortable pandas groupby , basic aggregation, bit beyond skill level. appreciated!

note: fields i've given in these dummy data not exact names in actual dataset, follow same naming conventions exactly. photo urls url1, url2, etc, , ratings fields denoted rating_<rating_category_number>(<url_number>), eg. rating_1(2). description fields denoted description(<url_number>), eg. description(2).

here's python code build initial dataset:

df = pd.dataframe({'id': {0: '001', 1: '002'},                    'url1': {0: 'a', 1: 'a'},                    'url2': {0: 'b', 1: 'b'},                    'url3': {0: 'c', 1: 'd'}})  df['rating_1(1)'] = [3.0, 1] df['rating_2(1)'] = [2.5, 4.5] df['rating_1(2)'] = [4.0, 3] df['rating_2(2)'] = [1.5, 3.5] df['rating_1(3)'] = [5.0, 1] df['rating_2(3)'] = [5.0, 3.5] df['description(1)'] = ['sunny day','sunshine'] df['description(2)'] = ['rainy day','rain'] df['description(3)'] = ['foggy day','snow'] 

you can first find columns each category str.contains , use not known pd.lreshape. last aggregate columns mean, var, count , join:

#select columns each category rat1 = df.columns[df.columns.str.contains(r'rating_1')].tolist() print rat1 ['rating_1(1)', 'rating_1(2)', 'rating_1(3)']  rat2 = df.columns[df.columns.str.contains(r'rating_2')].tolist() url = df.columns[df.columns.str.contains(r'url')].tolist() desc = df.columns[df.columns.str.contains(r'description')].tolist()  df =  pd.lreshape(df, {'rat1': rat1, 'rat2': rat2,'url': url,'desc': desc}) print df   rater_id url  rat2  rat1       desc 0    '001'     2.5   3.0  sunny day 1    '002'     4.5   1.0   sunshine 2    '001'   b   1.5   4.0  rainy day 3    '002'   b   3.5   3.0       rain 4    '001'   c   5.0   5.0  foggy day 5    '002'   d   3.5   1.0       snow  #aggregate df = df.groupby(['url']).agg({'rat1':['mean', 'var'],                               'rat2':['mean', 'var'],                                'desc': ' '.join,                                'rater_id': 'count'})  #reset multiindex in columns df.columns = ['_'.join(col) col in df.columns.values] 
print df      rater_id_count  rat2_mean  rat2_var  rat1_mean  rat1_var  \ url                                                                              2        3.5       2.0        2.0       2.0    b                 2        2.5       2.0        3.5       0.5    c                 1        5.0       nan        5.0       nan    d                 1        3.5       nan        1.0       nan                   desc_join   url                          sunny day sunshine   b        rainy day rain   c             foggy day   d                  snow   

Comments