One of the more common tasks of a data scientists is that of data cleaning. Unfortunately, not all datasets come to us ready for analysis out of the box. Some require a little work to be ready, and some require a little more. I recently came across a dataset that was REALLY in need of some serious cleaning. It is a dataset (found here) that catalogued academic studies and their cost to publish them openly. The fields include one or two ID numbers corresponding to each paper, the paper title, journal, publisher, and total cost in GBP of making the study open. Below I’ll go through and annotate my steps to clean (partially) this dataset. I say partially because there are so many unique cases of entries that need cleaning in the Journal and Publisher columns that to really clean requires more work that can be encapsulated in a blog post.

Note: you might argue that I could have done this work in fewer steps/lines of code. Sure, you’re right! I make no claims that this is the most terse or pythonic solution to cleaning the dataset. This post reflects my work in cleaning the dataset as I was doing the work for the first time. Of course, ex post facto, it’s much easier to think of a way to do the same task better.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
df = pd.read_excel("University returns_for_figshare_FINAL.xlsx")
df.head()
PMID/PMCID Publisher Journal title Article title COST (£) charged to Wellcome (inc VAT when charged)
0 PMC3378987\n Elsevier Academy of Nutrition and Dietetics Parent support and parent mediated behaviours ... 2379.54
1 PMCID: PMC3780468 ACS (Amercian Chemical Society) Publications ACS Chemical Biology A Novel Allosteric Inhibitor of the Uridine Di... 1294.59
2 PMCID: PMC3621575 ACS (Amercian Chemical Society) Publications ACS Chemical Biology Chemical proteomic analysis reveals the drugab... 1294.78
3 NaN American Chemical Society ACS Chemical Biology Discovery of β2 Adrenergic Receptor Ligands Us... 947.07
4 PMID: 24015914 PMC3833349 American Chemical Society ACS Chemical Biology Discovery of an allosteric inhibitor binding s... 1267.76
df.columns=["PMCID","Publisher","Journal Title","Article Title","TotalCost"]
df.head()
PMCID Publisher Journal Title Article Title TotalCost
0 PMC3378987\n Elsevier Academy of Nutrition and Dietetics Parent support and parent mediated behaviours ... 2379.54
1 PMCID: PMC3780468 ACS (Amercian Chemical Society) Publications ACS Chemical Biology A Novel Allosteric Inhibitor of the Uridine Di... 1294.59
2 PMCID: PMC3621575 ACS (Amercian Chemical Society) Publications ACS Chemical Biology Chemical proteomic analysis reveals the drugab... 1294.78
3 NaN American Chemical Society ACS Chemical Biology Discovery of β2 Adrenergic Receptor Ligands Us... 947.07
4 PMID: 24015914 PMC3833349 American Chemical Society ACS Chemical Biology Discovery of an allosteric inhibitor binding s... 1267.76
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2127 entries, 0 to 2126
Data columns (total 5 columns):
PMCID            1928 non-null object
Publisher        2127 non-null object
Journal Title    2126 non-null object
Article Title    2127 non-null object
TotalCost        2127 non-null float64
dtypes: float64(1), object(4)
memory usage: 83.2+ KB

Let’s start fixing the PMCID column. The first things I notice just from the head() are newlines actually in the value, “PMCID: “ still there when the column header says what the value is, and null values.

df["PMCID"].fillna("Unknown",inplace=True)
df["PMCID"]
0                          PMC3378987\n
1                     PMCID: PMC3780468
2                     PMCID: PMC3621575
3                               Unknown
4            PMID: 24015914 PMC3833349 
5                         : PMC3805332 
6              PMCID:\n    PMC3656742\n
7                        PMCID: 3584654
8                              23373658
9              PMCID:\n    PMC3727331\n
10                    PMCID: PMC3565438
11                    PMCID: PMC3668577
12                    PMCID: PMC3606566
13                     \n    PMC3498934
14            PMID:22993091 PMC3447403 
15                           PMC3087623
16                          PMC3808818 
17      PMID: 23828613 (July 2013 Epub)
18                           PMC3374517
19                           PMC3549237
20                              Unknown
21                           PMC3661931
22                              3535376
23                           PMC3798121
24         21624095  PMCID: PMC3734623 
25                           In Process
26                             23734913
27                              Unknown
28                              Unknown
29                              Unknown
                     ...               
2097         PMCID:\n    PMC3508281\n\n
2098                         PMC3717178
2099                         PMC3775257
2100                         PMC3558801
2101                            Unknown
2102                         PMC3770928
2103                         PMC3627817
2104             PMCID:\n    PMC3759846
2105             PMCID:\n    PMC3608034
2106                            Unknown
2107                  PMCID: PMC3599165
2108                         PMC3763375
2109                     PMID: 24035434
2110                         PMC3763374
2111                         PMC3404461
2112                 PMCID: PMC3740234 
2113                           23117109
2114                         PMC3757156
2115                            Unknown
2116                         PMC3611597
2117                         PMC3786614
2118                         PMC3716626
2119                  PMCID: PMC3791421
2120                           23562481
2121                         PMC3190389
2122                           23201205
2123                       pub Aug 2013
2124              23200744  PMC3552157 
2125                     PMC3472342\n\n
2126                  PMCID: PMC3600532
Name: PMCID, Length: 2127, dtype: object

OK, it seems from looking at the pubmed website, there are two identifiers, PMID and PMCID. The former is an 8-digit number and the latter is a 7-digit number preceded by “PMC”. The entries in this column that are just 8-digit numbers I’ll assume are PMIDs. I propose to just make this column into two separate columns: PMID and PMCID

def extractNonPMCID(row):
    id = str(row.PMCID)
    if len(id) < 4:
        return id
    elif id[0:3] == "PMC":
        return "Unknown"
    else:
        return id
    
def fixPMCID(id):
    id = str(id)
    if len(id) < 4:
        return "Unknown"
    elif id[0:3] == "PMC":
        return id
    else:
        return "Unknown"
        

df["PMID"] = df.apply(extractNonPMCID,axis=1)
df["PMCID"] = df["PMCID"].apply(fixPMCID)
df["PMCID"]
0                     PMC3378987\n
1                PMCID: PMC3780468
2                PMCID: PMC3621575
3                          Unknown
4                          Unknown
5                          Unknown
6         PMCID:\n    PMC3656742\n
7                   PMCID: 3584654
8                          Unknown
9         PMCID:\n    PMC3727331\n
10               PMCID: PMC3565438
11               PMCID: PMC3668577
12               PMCID: PMC3606566
13                         Unknown
14                         Unknown
15                      PMC3087623
16                     PMC3808818 
17                         Unknown
18                      PMC3374517
19                      PMC3549237
20                         Unknown
21                      PMC3661931
22                         Unknown
23                      PMC3798121
24                         Unknown
25                         Unknown
26                         Unknown
27                         Unknown
28                         Unknown
29                         Unknown
                   ...            
2097    PMCID:\n    PMC3508281\n\n
2098                    PMC3717178
2099                    PMC3775257
2100                    PMC3558801
2101                       Unknown
2102                    PMC3770928
2103                    PMC3627817
2104        PMCID:\n    PMC3759846
2105        PMCID:\n    PMC3608034
2106                       Unknown
2107             PMCID: PMC3599165
2108                    PMC3763375
2109                       Unknown
2110                    PMC3763374
2111                    PMC3404461
2112            PMCID: PMC3740234 
2113                       Unknown
2114                    PMC3757156
2115                       Unknown
2116                    PMC3611597
2117                    PMC3786614
2118                    PMC3716626
2119             PMCID: PMC3791421
2120                       Unknown
2121                    PMC3190389
2122                       Unknown
2123                       Unknown
2124                       Unknown
2125                PMC3472342\n\n
2126             PMCID: PMC3600532
Name: PMCID, Length: 2127, dtype: object
df["PMID"]
0                               Unknown
1                               Unknown
2                               Unknown
3                               Unknown
4            PMID: 24015914 PMC3833349 
5                         : PMC3805332 
6                               Unknown
7                               Unknown
8                              23373658
9                               Unknown
10                              Unknown
11                              Unknown
12                              Unknown
13                     \n    PMC3498934
14            PMID:22993091 PMC3447403 
15                              Unknown
16                              Unknown
17      PMID: 23828613 (July 2013 Epub)
18                              Unknown
19                              Unknown
20                              Unknown
21                              Unknown
22                              3535376
23                              Unknown
24         21624095  PMCID: PMC3734623 
25                           In Process
26                             23734913
27                              Unknown
28                              Unknown
29                              Unknown
                     ...               
2097                            Unknown
2098                            Unknown
2099                            Unknown
2100                            Unknown
2101                            Unknown
2102                            Unknown
2103                            Unknown
2104                            Unknown
2105                            Unknown
2106                            Unknown
2107                            Unknown
2108                            Unknown
2109                     PMID: 24035434
2110                            Unknown
2111                            Unknown
2112                            Unknown
2113                           23117109
2114                            Unknown
2115                            Unknown
2116                            Unknown
2117                            Unknown
2118                            Unknown
2119                            Unknown
2120                           23562481
2121                            Unknown
2122                           23201205
2123                       pub Aug 2013
2124              23200744  PMC3552157 
2125                            Unknown
2126                            Unknown
Name: PMID, Length: 2127, dtype: object

Wow, this really needed cleaning all right! Let’s first get the PMCID column in order, then we’ll attend to the PMID column (perhaps replacing some PMCID values it seems).

def purgeColons(id):
    s = id.split()
    if len(s) < 2:
        return id
    else:
        if s[1][0:3] == "PMC":
            return s[1]
        else:
            return "PMC"+s[1]
df["PMCID"] = df["PMCID"].apply(purgeColons)
# I saw a couple entries with two newlines and with a apsce
df["PMCID"] = df["PMCID"].apply( lambda x : x.strip("\n").strip("\n").strip(" ") )
df["PMCID"]
0       PMC3378987
1       PMC3780468
2       PMC3621575
3          Unknown
4          Unknown
5          Unknown
6       PMC3656742
7       PMC3584654
8          Unknown
9       PMC3727331
10      PMC3565438
11      PMC3668577
12      PMC3606566
13         Unknown
14         Unknown
15      PMC3087623
16      PMC3808818
17         Unknown
18      PMC3374517
19      PMC3549237
20         Unknown
21      PMC3661931
22         Unknown
23      PMC3798121
24         Unknown
25         Unknown
26         Unknown
27         Unknown
28         Unknown
29         Unknown
           ...    
2097    PMC3508281
2098    PMC3717178
2099    PMC3775257
2100    PMC3558801
2101       Unknown
2102    PMC3770928
2103    PMC3627817
2104    PMC3759846
2105    PMC3608034
2106       Unknown
2107    PMC3599165
2108    PMC3763375
2109       Unknown
2110    PMC3763374
2111    PMC3404461
2112    PMC3740234
2113       Unknown
2114    PMC3757156
2115       Unknown
2116    PMC3611597
2117    PMC3786614
2118    PMC3716626
2119    PMC3791421
2120       Unknown
2121    PMC3190389
2122       Unknown
2123       Unknown
2124       Unknown
2125    PMC3472342
2126    PMC3600532
Name: PMCID, Length: 2127, dtype: object

OK let’s see what remains to be corrected in the PMCID column

df.PMCID[(df.PMCID != "Unknown") & ~(df.PMCID.str.startswith("PMC3"))]
111           PMC:3697379
158      PMCID:PMC3542821
235            PMC2843621
357      PMCID:PMC3718323
384                   PMC
392      PMCID:PMC3778892
486                   PMC
613         PMC(Available
739            PMC2881129
831                 PMCin
880         PMCID:3106451
901      PMCID:PMC3268392
963            PMC2757965
1037     PMCID:PMC3708113
1077     PMCID:PMC3778837
1143     PMCID:PMC3759848
1178     PMCID:PMC3728563
1361     PMCID:PMC3709497
1371     PMCID:PMC3572772
1397           PMC2877799
1404               PMCID:
1464           PMC2779337
1533                PMCin
1570    PMCID:PMC3715547)
1638     PMCID:PMC3706442
1639     PMCID:PMC3711905
1778                PMCis
2008     PMCID:PMC3549627
2022         PMCID3135804
2031           PMC2955965
2038     PMCID:PMC3573229
2039     PMCID:PMC3755220
2064     PMCID:PMC3677413
2078     PMCID:PMC3731785
Name: PMCID, dtype: object
def corrPMCIDEdge(id):
    id = id.replace("PMCID:","")
    id = id.replace("PMCID","PMC")
    id = id.replace("PMC:","")
    id = id.replace("PMCin","Unknown")
    id = id.replace("PMCis","Unknown")
    if id == "PMC":
        return "Unknown"
    elif id == "PMC(Available":
        return "Unknown"
    else:
        return id
df["PMCID"]=df["PMCID"].apply(corrPMCIDEdge)
df.PMCID[(df.PMCID != "Unknown") & ~(df.PMCID.str.startswith("PMC3"))]
111        3697379
235     PMC2843621
739     PMC2881129
880        3106451
963     PMC2757965
1397    PMC2877799
1404              
1464    PMC2779337
2031    PMC2955965
Name: PMCID, dtype: object
# Do the rest by hand
df.at[111,"PMCID"] = "PMC3697379"
df.at[880,"PMCID"] = "PMC3106451"
df.at[1404,"PMCID"] = "Unknown"
df["PMCID"] = [str(x) for x in df.PMCID]
df[~(df.PMCID.str.startswith("PMC3")) & (df.PMCID != "Unknown")]["PMCID"]
235     PMC2843621
739     PMC2881129
963     PMC2757965
1397    PMC2877799
1464    PMC2779337
2031    PMC2955965
Name: PMCID, dtype: object

OK PMCID is clean! Let’s now tackle PMID

df.PMID
0                               Unknown
1                               Unknown
2                               Unknown
3                               Unknown
4            PMID: 24015914 PMC3833349 
5                         : PMC3805332 
6                               Unknown
7                               Unknown
8                              23373658
9                               Unknown
10                              Unknown
11                              Unknown
12                              Unknown
13                     \n    PMC3498934
14            PMID:22993091 PMC3447403 
15                              Unknown
16                              Unknown
17      PMID: 23828613 (July 2013 Epub)
18                              Unknown
19                              Unknown
20                              Unknown
21                              Unknown
22                              3535376
23                              Unknown
24         21624095  PMCID: PMC3734623 
25                           In Process
26                             23734913
27                              Unknown
28                              Unknown
29                              Unknown
                     ...               
2097                            Unknown
2098                            Unknown
2099                            Unknown
2100                            Unknown
2101                            Unknown
2102                            Unknown
2103                            Unknown
2104                            Unknown
2105                            Unknown
2106                            Unknown
2107                            Unknown
2108                            Unknown
2109                     PMID: 24035434
2110                            Unknown
2111                            Unknown
2112                            Unknown
2113                           23117109
2114                            Unknown
2115                            Unknown
2116                            Unknown
2117                            Unknown
2118                            Unknown
2119                            Unknown
2120                           23562481
2121                            Unknown
2122                           23201205
2123                       pub Aug 2013
2124              23200744  PMC3552157 
2125                            Unknown
2126                            Unknown
Name: PMID, Length: 2127, dtype: object
df.PMID = [str(x) for x in df.PMID]
df.PMID[df.PMID.str.contains("PMC")]
4              PMID: 24015914 PMC3833349 
5                           : PMC3805332 
13                       \n    PMC3498934
14              PMID:22993091 PMC3447403 
24           21624095  PMCID: PMC3734623 
31                   21948184 PMC3528370 
33             PMID: 23672932 PMC3684112 
54                    22971149 PMC3466778
64             PMID: 23907068 PMC3837358 
73             PMID: 22898127 PMC3830178 
74             PMID: 23159264 PMC3834139 
90                   23670821 PMC3738939 
93               PMID:22961729 PMC3547296
94               PMID:22730171 PMC3556687
95              PMID:21472932 PMC3555362 
107             PMID:23089748 PMC3535899 
108            PMID: 23856774 PMC3811480 
110      PMID:23817378  PMCID: PMC3754341
123                  23740368 PMC3840700 
124                   23460124 PMC3715109
132                   22302008 PMC3405838
137                  23025831 PMC3462035 
138                  23731076 PMC3670620 
139                   23244291 PMC3569044
147             PMID:22326920 PMC3480643 
149            PMID: 20377523 PMC3685215 
150           PMID: 23009366  PMC3685217 
162           PMID: 23356304  PMC3561678 
163           PMID: 23697937  PMC3685327 
164             PMCID:\n    PMC3554041\n 
                      ...                
1919            PMID: 23407782 PMC3676742
1920          PMID: 23299096  PMC3636441 
1921          PMID: 23681165  PMC3824307 
1923           PMID: 23332733 PMC3712184 
1935          22777780 PMCID: PMC3568905 
1950          23422316 PMCID: PMC3710354 
1955           PMID: 22704639 PMC3485564 
1956           PMID: 21764562  PMC3485562
1957                 23230506 PMC3516806 
1969                 23698510  PMC3812900
1993             PMID23287458 PMC3605587 
1996                 23356787 PMC3813989 
1999                   22618994 PMC344470
2010            PMID:23623732 PMC3791409 
2020                23375655  PMC3567274 
2025          PMID: 23282150  PMC3610541 
2030           PMID: 24076655 PMC3817463 
2033    PMID: 23371065 /PMCID: PMC3633812
2046                  23698002 PMC3696650
2049          PMID: 23125445  PMC3532835 
2051                 23576705, PMC3727040
2052          23804098 PMCID: PMC3724994 
2053                  23825402 PMC3528370
2062    PMID: 23041239 /PMCID: PMC3490334
2067    PMID: 23709760 /PMCID: PMC3756442
2070                 22901061 PMC3533787 
2074           PMID: 19948006 PMC3551259 
2080           PMID: 21940062 PMC3814186 
2083            PMID:23265842 PMC3569712 
2124                23200744  PMC3552157 
Name: PMID, Length: 276, dtype: object
pmcs = df.PMID[df.PMID.str.contains("PMC")].apply(lambda x : x.split()[-1])
pmcs[~pmcs.str.startswith("PMC3")]
632                     PMC
633                     PMC
784      (PMCID:PMC3757918)
1068           PMCID3274377
1273            -PMC3839404
1456            -PMC3661980
1537                3679597
1591    24039607/PMC3764205
1687    24124519/PMC3790821
Name: PMID, dtype: object
pmcs.drop(632,axis=0,inplace=True)
pmcs.drop(633,axis=0,inplace=True)
pmcs.at[784]="PMC3757918"
pmcs.at[1068]="PMC3274377"
pmcs.at[1273]="PMC3839404"
pmcs.at[1456]="PMC3661980"
pmcs.at[1537]="PMC3679597"
pmcs.at[1591]="PMC3764205"
pmcs.at[1687]="PMC3790821"
# pmcs.index
for i in pmcs.index:
    print df.PMCID[i],pmcs[i]
Unknown PMC3833349
Unknown PMC3805332
Unknown PMC3498934
Unknown PMC3447403
Unknown PMC3734623
Unknown PMC3528370
Unknown PMC3684112
Unknown PMC3466778
Unknown PMC3837358
Unknown PMC3830178
Unknown PMC3834139
Unknown PMC3738939
Unknown PMC3547296
Unknown PMC3556687
Unknown PMC3555362
Unknown PMC3535899
Unknown PMC3811480
Unknown PMC3754341
Unknown PMC3840700
Unknown PMC3715109
Unknown PMC3405838
Unknown PMC3462035
Unknown PMC3670620
Unknown PMC3569044
Unknown PMC3480643
Unknown PMC3685215
Unknown PMC3685217
Unknown PMC3561678
Unknown PMC3685327
Unknown PMC3554041
Unknown PMC3413243
Unknown PMC3714738
Unknown PMC3630740
Unknown PMC3694669
Unknown PMC3381227
Unknown PMC3736666
Unknown PMC3814418
Unknown PMC3790949
Unknown PMC3653566
Unknown PMC3790942
Unknown PMC3846689
Unknown PMC3612817
Unknown PMC3525062
Unknown PMC3573228
Unknown PMC3596649
Unknown PMC3687655
Unknown PMC3724117
Unknown PMC3730114
Unknown PMC3491872
Unknown PMC3674087
Unknown PMC3619236
Unknown PMC3601332
Unknown PMC3510442
Unknown PMC3689915
Unknown PMC3753470
Unknown PMC3599488
Unknown PMC3599488
Unknown PMC3715701
Unknown PMC3539454
Unknown PMC3625746
Unknown PMC3792637
Unknown PMC3401426
Unknown PMC3569615
Unknown PMC3743159
Unknown PMC3807794
Unknown PMC3566545
Unknown PMC3629559
Unknown PMC3706957
Unknown PMC3502867
Unknown PMC3485558
Unknown PMC3593001
Unknown PMC3521961
Unknown PMC3709123
Unknown PMC3381227
Unknown PMC3778978
Unknown PMC3849851
Unknown PMC3638368
Unknown PMC3427858
Unknown PMC3629561
Unknown PMC3744751
Unknown PMC3630358
Unknown PMC3791366
Unknown PMC3701237
Unknown PMC3707360
Unknown PMC3669511
Unknown PMC3547489
Unknown PMC3657127
Unknown PMC3781703
Unknown PMC3744815
Unknown PMC3784797
Unknown PMC3745826
Unknown PMC3630327
Unknown PMC3659828
Unknown PMC3572582
Unknown PMC3564407
Unknown PMC3734648
Unknown PMC3806363
Unknown PMC3642154
Unknown PMC3668512
Unknown PMC3654214
Unknown PMC3693486
Unknown PMC3696884
Unknown PMC3664024
Unknown PMC3638342
Unknown PMC3487588
Unknown PMC3591252
Unknown PMC3727344
Unknown PMC3744754
Unknown PMC3757918
Unknown PMC3652415
Unknown PMC3842180
Unknown PMC3710569
Unknown PMC3824577
Unknown PMC3497410
Unknown PMC3536122
Unknown PMC3631001
Unknown PMC3819985
Unknown PMC3566929
Unknown PMC3821636
Unknown PMC3734734
Unknown PMC3504370
Unknown PMC3634610
Unknown PMC3730248
Unknown PMC3672848
Unknown PMC3506128
Unknown PMC3646402
Unknown PMC3601604
Unknown PMC3797641
Unknown PMC3561570
Unknown PMC3493908
Unknown PMC3543027
Unknown PMC3585060
Unknown PMC3711325
Unknown PMC3531760
Unknown PMC3516716
Unknown PMC3591641
Unknown PMC3668706
Unknown PMC3750171
Unknown PMC3576117
Unknown PMC3662417
Unknown PMC3746124
Unknown PMC3828593
Unknown PMC3562487
Unknown PMC3651585
Unknown PMC3784648
Unknown PMC3813945
Unknown PMC3708032
Unknown PMC3851688
Unknown PMC3756433
Unknown PMC3576129
Unknown PMC3749055
Unknown PMC3709586
Unknown PMC3709636
Unknown PMC3709573
Unknown PMC3709587
Unknown PMC3542733
Unknown PMC3709688
Unknown PMC3607399
Unknown PMC3619528
Unknown PMC3540259
Unknown PMC3654748
Unknown PMC3274377
Unknown PMC3513931
Unknown PMC3798117
Unknown PMC3579312
Unknown PMC3709824
Unknown PMC3593212
Unknown PMC3535724
Unknown PMC3664183
Unknown PMC3544867
Unknown PMC3664272
Unknown PMC3601669
Unknown PMC3593678
Unknown PMC3586169
Unknown PMC3754066
Unknown PMC3554137
Unknown PMC3457148
Unknown PMC3807398
Unknown PMC3807390
Unknown PMC3641668
Unknown PMC3677138
Unknown PMC3691584
Unknown PMC3440593
Unknown PMC3520108
Unknown PMC3839404
Unknown PMC3836489
Unknown PMC3495574
Unknown PMC3581773
Unknown PMC3820028
Unknown PMC3625108
Unknown PMC3587388
Unknown PMC3529057
Unknown PMC3528594
Unknown PMC3826647
Unknown PMC3514496
Unknown PMC3685308
Unknown PMC3782194
Unknown PMC3532594
Unknown PMC3757163
Unknown PMC3707011
Unknown PMC3480648
Unknown PMC3555187
Unknown PMC3580290
Unknown PMC3542422
Unknown PMC3611599
Unknown PMC3661980
Unknown PMC3675483
Unknown PMC3753647
Unknown PMC3401426
Unknown PMC3643581
Unknown PMC3834809
Unknown PMC3245117
Unknown PMC3492912
Unknown PMC3798095
Unknown PMC3717798
Unknown PMC3809720
Unknown PMC3679597
Unknown PMC3764205
Unknown PMC3542187
Unknown PMC3541403
Unknown PMC3558488
Unknown PMC3487758
Unknown PMC3790821
Unknown PMC3733773
Unknown PMC3716603
Unknown PMC3689675
Unknown PMC3556037
Unknown PMC3491040
Unknown PMC3519623
Unknown PMC3632543
Unknown PMC3686795
Unknown PMC3762823
Unknown PMC3576415
Unknown PMC3585132
Unknown PMC3596319
Unknown PMC3841754
Unknown PMC3712186
Unknown PMC3824070
Unknown PMC3465389
Unknown PMC3529034
Unknown PMC3562850
Unknown PMC3568321
Unknown PMC3581986
Unknown PMC3569869
Unknown PMC3676742
Unknown PMC3636441
Unknown PMC3824307
Unknown PMC3712184
Unknown PMC3568905
Unknown PMC3710354
Unknown PMC3485564
Unknown PMC3485562
Unknown PMC3516806
Unknown PMC3812900
Unknown PMC3605587
Unknown PMC3813989
Unknown PMC344470
Unknown PMC3791409
Unknown PMC3567274
Unknown PMC3610541
Unknown PMC3817463
Unknown PMC3633812
Unknown PMC3696650
Unknown PMC3532835
Unknown PMC3727040
Unknown PMC3724994
Unknown PMC3528370
Unknown PMC3490334
Unknown PMC3756442
Unknown PMC3533787
Unknown PMC3551259
Unknown PMC3814186
Unknown PMC3569712
Unknown PMC3552157
for i in pmcs.index:
    df.at[i,"PMCID"]=pmcs[i]

We were able to find PMCIDs for lots of the Unknowns using the PMID entries!

df.PMID[df.PMID != "Unknown"]
4              PMID: 24015914 PMC3833349 
5                           : PMC3805332 
8                                23373658
13                       \n    PMC3498934
14              PMID:22993091 PMC3447403 
17        PMID: 23828613 (July 2013 Epub)
22                                3535376
24           21624095  PMCID: PMC3734623 
25                             In Process
26                               23734913
31                   21948184 PMC3528370 
32                               23340916
33             PMID: 23672932 PMC3684112 
35            print published August 2013
37                                3748854
39                                3633780
40                             In Process
51                               23514390
52                                3516067
53                               23599091
54                    22971149 PMC3466778
61                                3813311
64             PMID: 23907068 PMC3837358 
69                                3567269
71                    published Sept 2013
72                                3947303
73             PMID: 22898127 PMC3830178 
74             PMID: 23159264 PMC3834139 
75                                3708126
77                PMID: 23409903 23409903
                      ...                
2018                       PMID: 23649934
2020                23375655  PMC3567274 
2025          PMID: 23282150  PMC3610541 
2030           PMID: 24076655 PMC3817463 
2033    PMID: 23371065 /PMCID: PMC3633812
2036                             23253866
2037                    Not yet available
2046                  23698002 PMC3696650
2049          PMID: 23125445  PMC3532835 
2051                 23576705, PMC3727040
2052          23804098 PMCID: PMC3724994 
2053                  23825402 PMC3528370
2062    PMID: 23041239 /PMCID: PMC3490334
2063                      PMID : 23773811
2067    PMID: 23709760 /PMCID: PMC3756442
2070                 22901061 PMC3533787 
2073                              3744763
2074           PMID: 19948006 PMC3551259 
2075                       PMID: 23305527
2079                         pub Aug 2013
2080           PMID: 21940062 PMC3814186 
2083            PMID:23265842 PMC3569712 
2086                       print in press
2096                             22364555
2109                       PMID: 24035434
2113                             23117109
2120                             23562481
2122                             23201205
2123                         pub Aug 2013
2124                23200744  PMC3552157 
Name: PMID, Length: 741, dtype: object

Whoa, there are lots of different cases to consider…Looks like one of the most common is

PMID: XXXXXXXX PMCYYYYYYY

Let’s tackle those

df.PMID[df.PMID.str.contains("PMC")] = df.PMID[df.PMID.str.contains("PMC")].apply(lambda x : x.replace("/","")) #drop slashes

def case1(id):
    s = id.split()
    if len(s) != 3:
        return id
    elif s[0] == "PMID:" and s[2][0:3]=="PMC":
#             print s[1]
            return s[1]
    else:
        return id
df["PMID"]=df.PMID.apply(case1)
/Users/skaplan/Library/Python/2.7/lib/python/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

Now let’s deal with the case

XXXXXXXX PMCYYYYYYY
df.PMID[df.PMID.str.contains("PMC")]
5                           : PMC3805332 
13                       \n    PMC3498934
14              PMID:22993091 PMC3447403 
24           21624095  PMCID: PMC3734623 
31                   21948184 PMC3528370 
54                    22971149 PMC3466778
90                   23670821 PMC3738939 
93               PMID:22961729 PMC3547296
94               PMID:22730171 PMC3556687
95              PMID:21472932 PMC3555362 
107             PMID:23089748 PMC3535899 
110      PMID:23817378  PMCID: PMC3754341
123                  23740368 PMC3840700 
124                   23460124 PMC3715109
132                   22302008 PMC3405838
137                  23025831 PMC3462035 
138                  23731076 PMC3670620 
139                   23244291 PMC3569044
147             PMID:22326920 PMC3480643 
164             PMCID:\n    PMC3554041\n 
165                            PMC3413243
171                  23458425 PMC3630740 
205                   22738332 PMC3381227
211           23931322 PMCID: PMC3736666 
212                  23877967 PMC3814418 
227                  23002116 PMC3790949 
229                  23255554 PMC3653566 
230                  23043070 PMC3790942 
321                  23246923 PMC3573228 
338           23121507 PMCID: PMC3596649 
                      ...                
1785     PMID: 24023766 PMCID: PMC3762823
1793                     \n    PMC3576415
1825     PMID: 23468629 PMCID: PMC3585132
1841                         \nPMC3596319
1854                 23924687 PMC3841754 
1856    PMID: 23375993 PMCID: PMC3712186 
1863                   PMCID: PMC3465389 
1880                 23213245 PMC3529034 
1885                  23341602 PMC3568321
1892             PMID:23391734 PMC3581986
1907            PMID:22997008 PMC3569869 
1935          22777780 PMCID: PMC3568905 
1950          23422316 PMCID: PMC3710354 
1957                 23230506 PMC3516806 
1969                 23698510  PMC3812900
1993             PMID23287458 PMC3605587 
1996                 23356787 PMC3813989 
1999                   22618994 PMC344470
2010            PMID:23623732 PMC3791409 
2020                23375655  PMC3567274 
2033     PMID: 23371065 PMCID: PMC3633812
2046                  23698002 PMC3696650
2051                 23576705, PMC3727040
2052          23804098 PMCID: PMC3724994 
2053                  23825402 PMC3528370
2062     PMID: 23041239 PMCID: PMC3490334
2067     PMID: 23709760 PMCID: PMC3756442
2070                 22901061 PMC3533787 
2083            PMID:23265842 PMC3569712 
2124                23200744  PMC3552157 
Name: PMID, Length: 194, dtype: object
def case2(id):
    s = id.split()
    if len(s) != 2:
        return id
    elif len(s[0]) == 8 and s[1][0:3]=="PMC":
#             print s[0],s[1]
            return s[0]
    else:
        return id
def isCase2(id):
    s = id.split()
    if len(s) != 2:
        return False
    elif len(s[0]) == 8 and s[1][0:3]=="PMC":
            return True
    else:
        return False

df["isCase2"]=df.PMID.apply(isCase2)
pmcids = df[df.isCase2]["PMID"].apply(lambda x : x.split()[1])
df["PMID"] = df.PMID.apply(case2)
for i in pmcids.index:
    df.at[i,"PMCID"] = pmcids[i]
df.drop("isCase2",axis=1,inplace=True)
df.PMID[df.PMID.str.contains("PMC")]
5                           : PMC3805332 
13                       \n    PMC3498934
14              PMID:22993091 PMC3447403 
24           21624095  PMCID: PMC3734623 
93               PMID:22961729 PMC3547296
94               PMID:22730171 PMC3556687
95              PMID:21472932 PMC3555362 
107             PMID:23089748 PMC3535899 
110      PMID:23817378  PMCID: PMC3754341
147             PMID:22326920 PMC3480643 
164             PMCID:\n    PMC3554041\n 
165                            PMC3413243
211           23931322 PMCID: PMC3736666 
338           23121507 PMCID: PMC3596649 
422               PMID22345357 PMC3539454
425             PMID:23483642 PMC3625746 
426          23418011  PMCID: PMC3792637 
431           22996943 PMCID: PMC3569615 
434           23495205685.88 PMC3743159  
468             PMID:23143153 PMC3566545 
562                           PMC3791366 
596              PMID:22928509 PMC3707360
632         Monograph chapter, not in PMC
633         Monograph chapter, not in PMC
642              PMID:23220237 PMC3630327
758             PMID:22419580 PMC3487588 
784                    (PMCID:PMC3757918)
793               PMID23396536 PMC3652415
830             PMID:22988015 PMC3497410 
834             PMID:23115042 PMC3536122 
                      ...                
1502            PMID:23154538 PMC3717798 
1537                          PMC 3679597
1591                   24039607PMC3764205
1612          23326615 PMCID: PMC3542187 
1672                        : PMC3487758 
1687                   24124519PMC3790821
1717    PMID: 23940704 PMCID: PMC3733773 
1718     PMID: 23894442 PMCID: PMC3716603
1756                   PMCID: PMC3491040 
1762             PMID23251490 PMC3519623 
1766     PMID:23613774  PMCID: PMC3632543
1770               23840430 : PMC3686795 
1785     PMID: 24023766 PMCID: PMC3762823
1793                     \n    PMC3576415
1825     PMID: 23468629 PMCID: PMC3585132
1841                         \nPMC3596319
1856    PMID: 23375993 PMCID: PMC3712186 
1863                   PMCID: PMC3465389 
1892             PMID:23391734 PMC3581986
1907            PMID:22997008 PMC3569869 
1935          22777780 PMCID: PMC3568905 
1950          23422316 PMCID: PMC3710354 
1993             PMID23287458 PMC3605587 
2010            PMID:23623732 PMC3791409 
2033     PMID: 23371065 PMCID: PMC3633812
2051                 23576705, PMC3727040
2052          23804098 PMCID: PMC3724994 
2062     PMID: 23041239 PMCID: PMC3490334
2067     PMID: 23709760 PMCID: PMC3756442
2083            PMID:23265842 PMC3569712 
Name: PMID, Length: 103, dtype: object

Another pattern seems to be

PMID:XXXXXXXX PMCYYYYYYY
def case3(id):
    s = id.split()
    if len(s) != 2:
        return id
    elif (len(s[0]) == 12 or len(s[0]) == 13)  and s[1][0:3]=="PMC":
#             print s[0],s[1]
            return s[0][5:]
    else:
        return id
def isCase3(id):
    s = id.split()
    if len(s) != 2:
        return False
    elif len(s[0]) == 13 and s[1][0:3]=="PMC":
            return True
    else:
        return False

df["isCase3"] = df.PMID.apply(isCase3)
pmcids = df[df.isCase3]["PMID"].apply(lambda x : x.split()[1])
pmcids.at[1068] = pmcids.at[1068].replace("ID","")
df["PMID"] = df.PMID.apply(case3)
for i in pmcids.index:
    df.at[i,"PMCID"] = pmcids[i]
df.drop("isCase3",axis=1,inplace=True)
df.PMID[df.PMID.str.contains("PMC")]
5                           : PMC3805332 
13                       \n    PMC3498934
24           21624095  PMCID: PMC3734623 
110      PMID:23817378  PMCID: PMC3754341
164             PMCID:\n    PMC3554041\n 
165                            PMC3413243
211           23931322 PMCID: PMC3736666 
338           23121507 PMCID: PMC3596649 
426          23418011  PMCID: PMC3792637 
431           22996943 PMCID: PMC3569615 
434           23495205685.88 PMC3743159  
562                           PMC3791366 
632         Monograph chapter, not in PMC
633         Monograph chapter, not in PMC
784                    (PMCID:PMC3757918)
868      PMID: 23209475 PMCID: PMC3504370
872            PMID:23570314 - PMC3634610
884                23319734 : PMC3672848 
922      PMID: 23788479 PMCID: PMC3797641
958                         : PMC3711325 
960                            PMC3516716
972                   2339840 PMC3591641 
986                    PMCID: PMC3750171 
990           23704797 PMCID: PMC3662417 
1016     PMID: 23846817 PMCID: PMC3784648
1017         PMID: 23783094 : PMC3813945 
1050          23324467 PMCID: PMC3709587 
1062         23455506  PMCID: PMC3607399 
1063          PMID:23475217 - PMC3619528 
1064      PMID:23046967 PMCID: PMC3540259
                      ...                
1273            PMID:23518266 -PMC3839404
1287    PMID:23769710  PMCID: PMC3820028 
1309         23256604  PMCID: PMC3587388 
1456            PMID:23500098 -PMC3661980
1478          23595147 PMCID: PMC3675483 
1479     PMID: 23771140 PMCID: PMC3753647
1491     PMID: 21948792 PMCID: PMC3245117
1537                          PMC 3679597
1591                   24039607PMC3764205
1612          23326615 PMCID: PMC3542187 
1672                        : PMC3487758 
1687                   24124519PMC3790821
1717    PMID: 23940704 PMCID: PMC3733773 
1718     PMID: 23894442 PMCID: PMC3716603
1756                   PMCID: PMC3491040 
1766     PMID:23613774  PMCID: PMC3632543
1770               23840430 : PMC3686795 
1785     PMID: 24023766 PMCID: PMC3762823
1793                     \n    PMC3576415
1825     PMID: 23468629 PMCID: PMC3585132
1841                         \nPMC3596319
1856    PMID: 23375993 PMCID: PMC3712186 
1863                   PMCID: PMC3465389 
1935          22777780 PMCID: PMC3568905 
1950          23422316 PMCID: PMC3710354 
2033     PMID: 23371065 PMCID: PMC3633812
2051                 23576705, PMC3727040
2052          23804098 PMCID: PMC3724994 
2062     PMID: 23041239 PMCID: PMC3490334
2067     PMID: 23709760 PMCID: PMC3756442
Name: PMID, Length: 67, dtype: object
def case4(id):
    s = id.split()
    if len(s) != 4:
        return id
    elif s[0]=="PMID:"  and s[2]=="PMCID:":
#             print s[0],s[1]
            return s[1]
    else:
        return id
def isCase4(id):
    s = id.split()
    if len(s) != 4:
        return False
    elif s[0]=="PMID:"  and s[2]=="PMCID:":
            return True
    else:
        return False

df["isCase4"] = df.PMID.apply(isCase4)
pmcids = df[df.isCase4]["PMID"].apply(lambda x : x.split()[-1])
df["PMID"] = df.PMID.apply(case4)
for i in pmcids.index:
    df.at[i,"PMCID"] = pmcids[i]
df.drop("isCase4",axis=1,inplace=True)
df.PMID[df.PMID.str.contains("PMC")]
5                           : PMC3805332 
13                       \n    PMC3498934
24           21624095  PMCID: PMC3734623 
110      PMID:23817378  PMCID: PMC3754341
164             PMCID:\n    PMC3554041\n 
165                            PMC3413243
211           23931322 PMCID: PMC3736666 
338           23121507 PMCID: PMC3596649 
426          23418011  PMCID: PMC3792637 
431           22996943 PMCID: PMC3569615 
434           23495205685.88 PMC3743159  
562                           PMC3791366 
632         Monograph chapter, not in PMC
633         Monograph chapter, not in PMC
784                    (PMCID:PMC3757918)
872            PMID:23570314 - PMC3634610
884                23319734 : PMC3672848 
958                         : PMC3711325 
960                            PMC3516716
972                   2339840 PMC3591641 
986                    PMCID: PMC3750171 
990           23704797 PMCID: PMC3662417 
1017         PMID: 23783094 : PMC3813945 
1050          23324467 PMCID: PMC3709587 
1062         23455506  PMCID: PMC3607399 
1063          PMID:23475217 - PMC3619528 
1064      PMID:23046967 PMCID: PMC3540259
1098         23318955  PMCID: PMC3593212 
1105               23474851 : PMC3664183 
1132               23447590 : PMC3601669 
1211          22875964 PMCID: PMC3457148 
1218               23567253 : PMC3641668 
1243                         \nPMC3691584
1273            PMID:23518266 -PMC3839404
1287    PMID:23769710  PMCID: PMC3820028 
1309         23256604  PMCID: PMC3587388 
1456            PMID:23500098 -PMC3661980
1478          23595147 PMCID: PMC3675483 
1537                          PMC 3679597
1591                   24039607PMC3764205
1612          23326615 PMCID: PMC3542187 
1672                        : PMC3487758 
1687                   24124519PMC3790821
1756                   PMCID: PMC3491040 
1766     PMID:23613774  PMCID: PMC3632543
1770               23840430 : PMC3686795 
1793                     \n    PMC3576415
1841                         \nPMC3596319
1863                   PMCID: PMC3465389 
1935          22777780 PMCID: PMC3568905 
1950          23422316 PMCID: PMC3710354 
2051                 23576705, PMC3727040
2052          23804098 PMCID: PMC3724994 
Name: PMID, dtype: object
def case5(id):
    s = id.split()
    if len(s) != 3:
        return id
    elif s[0][0:5]=="PMID:"  and s[1]=="PMCID:":
#             print s[0],s[1]
            return s[0][5:]
    else:
        return id
def isCase5(id):
    s = id.split()
    if len(s) != 3:
        return False
    elif len(s[0]) == 13 and s[1][0:3]=="PMC":
            return True
    else:
        return False

df["isCase5"] = df.PMID.apply(isCase5)
pmcids = df[df.isCase5]["PMID"].apply(lambda x : x.split()[-1])
df["PMID"] = df.PMID.apply(case5)
for i in pmcids.index:
    df.at[i,"PMCID"] = pmcids[i]
df.drop("isCase5",axis=1,inplace=True)
df.PMID[df.PMID.str.contains("PMC")]
5                       : PMC3805332 
13                   \n    PMC3498934
24       21624095  PMCID: PMC3734623 
164         PMCID:\n    PMC3554041\n 
165                        PMC3413243
211       23931322 PMCID: PMC3736666 
338       23121507 PMCID: PMC3596649 
426      23418011  PMCID: PMC3792637 
431       22996943 PMCID: PMC3569615 
434       23495205685.88 PMC3743159  
562                       PMC3791366 
632     Monograph chapter, not in PMC
633     Monograph chapter, not in PMC
784                (PMCID:PMC3757918)
872        PMID:23570314 - PMC3634610
884            23319734 : PMC3672848 
958                     : PMC3711325 
960                        PMC3516716
972               2339840 PMC3591641 
986                PMCID: PMC3750171 
990       23704797 PMCID: PMC3662417 
1017     PMID: 23783094 : PMC3813945 
1050      23324467 PMCID: PMC3709587 
1062     23455506  PMCID: PMC3607399 
1063      PMID:23475217 - PMC3619528 
1098     23318955  PMCID: PMC3593212 
1105           23474851 : PMC3664183 
1132           23447590 : PMC3601669 
1211      22875964 PMCID: PMC3457148 
1218           23567253 : PMC3641668 
1243                     \nPMC3691584
1273        PMID:23518266 -PMC3839404
1309     23256604  PMCID: PMC3587388 
1456        PMID:23500098 -PMC3661980
1478      23595147 PMCID: PMC3675483 
1537                      PMC 3679597
1591               24039607PMC3764205
1612      23326615 PMCID: PMC3542187 
1672                    : PMC3487758 
1687               24124519PMC3790821
1756               PMCID: PMC3491040 
1770           23840430 : PMC3686795 
1793                 \n    PMC3576415
1841                     \nPMC3596319
1863               PMCID: PMC3465389 
1935      22777780 PMCID: PMC3568905 
1950      23422316 PMCID: PMC3710354 
2051             23576705, PMC3727040
2052      23804098 PMCID: PMC3724994 
Name: PMID, dtype: object
def case6(id):
    s = id.split()
    if len(s) != 3:
        return id
    elif len(s[0])==8  and s[1]=="PMCID:":
#             print s[0],s[1]
            return s[0]
    else:
        return id
def isCase6(id):
    s = id.split()
    if len(s) != 3:
        return False
    elif len(s[0])==8  and s[1]=="PMCID:":
#             print s[0],s[1]
            return True
    else:
        return False

df["isCase6"] = df.PMID.apply(isCase6)
pmcids = df[df.isCase6]["PMID"].apply(lambda x : x.split()[-1])
df["PMID"] = df.PMID.apply(case6)
for i in pmcids.index:
    df.at[i,"PMCID"] = pmcids[i]
df.drop("isCase6",axis=1,inplace=True)
df.PMID[df.PMID.str.contains("PMC")]
5                       : PMC3805332 
13                   \n    PMC3498934
164         PMCID:\n    PMC3554041\n 
165                        PMC3413243
434       23495205685.88 PMC3743159  
562                       PMC3791366 
632     Monograph chapter, not in PMC
633     Monograph chapter, not in PMC
784                (PMCID:PMC3757918)
872        PMID:23570314 - PMC3634610
884            23319734 : PMC3672848 
958                     : PMC3711325 
960                        PMC3516716
972               2339840 PMC3591641 
986                PMCID: PMC3750171 
1017     PMID: 23783094 : PMC3813945 
1063      PMID:23475217 - PMC3619528 
1105           23474851 : PMC3664183 
1132           23447590 : PMC3601669 
1218           23567253 : PMC3641668 
1243                     \nPMC3691584
1273        PMID:23518266 -PMC3839404
1456        PMID:23500098 -PMC3661980
1537                      PMC 3679597
1591               24039607PMC3764205
1672                    : PMC3487758 
1687               24124519PMC3790821
1756               PMCID: PMC3491040 
1770           23840430 : PMC3686795 
1793                 \n    PMC3576415
1841                     \nPMC3596319
1863               PMCID: PMC3465389 
2051             23576705, PMC3727040
Name: PMID, dtype: object

Just do the rest by hand

df.at[5,"PMID"] = "Unknown"
df.at[5,"PMCID"] = "PMC3805332"
df.at[13,"PMID"] = "Unknown"
df.at[13,"PMCID"] = "PMC3498934"
df.at[164,"PMID"] = "Unknown"
df.at[164,"PMCID"] = "PMC3554041"
df.at[165,"PMID"] = "Unknown"
df.at[165,"PMCID"] = "PMC3413243"
df.at[434,"PMID"] = "23495205" # looked in pubmed
df.at[434,"PMCID"] = "PMC3743159"
df.at[562,"PMID"] = "Unknown"
df.at[562,"PMCID"] = "PMC3791366"
df.at[632,"PMID"] = "Unknown"
df.at[632,"PMCID"] = "Unknown"
df.at[633,"PMID"] = "Unknown"
df.at[633,"PMCID"] = "Unknown"
df.at[784,"PMID"] = "Unknown"
df.at[784,"PMCID"] = "PMC3757918"
df.at[872,"PMID"] = "23570314"
df.at[872,"PMCID"] = "PMC3634610"
df.at[884,"PMID"] = "23319734"
df.at[884,"PMCID"] = "PMC3672848"
df.at[958,"PMID"] = "Unknown"
df.at[958,"PMCID"] = "PMC3711325"
df.at[960,"PMID"] = "Unknown"
df.at[960,"PMCID"] = "PMC3516716"
df.at[972,"PMID"] = "23329840" # looked in pubmed, the 2 was missing!
df.at[972,"PMCID"] = "PMC3591641"
df.at[986,"PMID"] = "Unknown"
df.at[986,"PMCID"] = "PMC3750171"
df.at[1017,"PMID"] = "23783094"
df.at[1017,"PMCID"] = "PMC3813945"
df.at[1063,"PMID"] = "23475217"
df.at[1063,"PMCID"] = "PMC3619528"
df.at[1105,"PMID"] = "23474851"
df.at[1105,"PMCID"] = "PMC3664183"
df.at[1132,"PMID"] = "23447590"
df.at[1132,"PMCID"] = "PMC3601669"
df.at[1218,"PMID"] = "23567253"
df.at[1218,"PMCID"] = "PMC3641668" 
df.at[1243,"PMID"] = "Unknown"
df.at[1243,"PMCID"] = "PMC3691584"
df.at[1273,"PMID"] = "23518266"
df.at[1273,"PMCID"] = "PMC3839404"
df.at[1456,"PMID"] = "23500098"
df.at[1456,"PMCID"] = "PMC3661980"
df.at[1537,"PMID"] = "Unknown"
df.at[1537,"PMCID"] = "PMC3679597"
df.at[1591,"PMID"] = "24039607"
df.at[1591,"PMCID"] = "PMC3764205"
df.at[1672,"PMID"] = "Unknown"
df.at[1672,"PMCID"] = "PMC3487758"
df.at[1687,"PMID"] = "24124519"
df.at[1687,"PMCID"] = "PMC3790821"
df.at[1756,"PMID"] = "Unknown"
df.at[1756,"PMCID"] = "PMC3491040"
df.at[1770,"PMID"] = "23840430"
df.at[1770,"PMCID"] = "PMC3686795"
df.at[1793,"PMID"] = "Unknown"
df.at[1793,"PMCID"] = "PMC3576415"
df.at[1841,"PMID"] = "Unknown"
df.at[1841,"PMCID"] = "PMC3596319"
df.at[1863,"PMID"] = "Unknown"
df.at[1863,"PMCID"] = "PMC3465389"
df.at[2051,"PMID"] = "23576705"
df.at[2051,"PMCID"] = "PMC3727040"
df.PMID[df.PMID.str.contains(":")]
17        PMID: 23828613 (July 2013 Epub)
77                PMID: 23409903 23409903
78                          PMID:23331923
114                         PMID:23706023
125                         PMID:23971731
166                       PMID: 23459248 
187                        PMID: 23245760
188                        PMID: 23452663
189                        PMID: 23510580
234                         PMID:23442822
327                        PMID: 23822584
334                        PMID: 23927070
349                         PMID:23137731
367                        PMID: 23966160
369                         PMID:23974100
387                        PMID: 23994477
394               PMID: 23041313 23041313
413                         PMID:24047602
424                         PMID:24009110
432                        PMID: 23918316
495                        PMID: 23267662
499                        PMID: 23935057
501                         PMID:24118397
513                        PMID: 23861059
552                        PMID: 22171404
648                        PMID: 23879611
654                       PMID: 23118027 
670                       PMID : 23763300
716                        PMID: 23256812
717                        PMID: 24024949
                      ...                
1229                        PMID:23545320
1272                       PMID: 23494299
1306                        PMID:23182425
1307                        PMID:23200636
1312                        PMID:24125554
1314                       PMID: 23980694
1315                       PMID: 23750903
1316                       PMID: 23888912
1374                        PMID:23639206
1381                       PMID: 18755526
1442                       PMID: 23295962
1446                        PMID:24056298
1475                        PMID:24038356
1498                       PMID: 23686316
1499                       PMID: 23708655
1500                        PMID:23975431
1527    PMID: 23931634 Aug 2013 Epub date
1541                        PMID:23023652
1543                        PMID:23898885
1678                        PMID:23300533
1816                        PMID:23633946
1857                       PMID: 22964003
1883              PMID: 23650371 23650371
1948                        PMID:23635806
1953                       PMID: 23236081
2001                       PMID: 23703895
2018                       PMID: 23649934
2063                      PMID : 23773811
2075                       PMID: 23305527
2109                       PMID: 24035434
Name: PMID, Length: 85, dtype: object
def removeColon(id):
    if len(id)==13 and id[0:5]=="PMID:":
        return id[6:]
    else:
        return id
df["PMID"] = df["PMID"].apply(removeColon)
def removeColonSpace(id):
    if len(id)==14 and id[0:6]=="PMID: ":
        return id[7:]
    else:
        return id
df["PMID"] = df["PMID"].apply(removeColonSpace)
df.PMID[df.PMID.str.contains(":")]
17        PMID: 23828613 (July 2013 Epub)
77                PMID: 23409903 23409903
166                       PMID: 23459248 
394               PMID: 23041313 23041313
654                       PMID: 23118027 
670                       PMID : 23763300
741        PMID: 23703681 (May 2013 Epub)
907      PMID: 23846567  (Epub July 2013)
1036                      PMID: 22360292 
1067                      PMID:  23945372
1069       PMID:24048963 (Epub Sept 2013)
1076       PMID: 23962810  Aug 2013 Epub 
1527    PMID: 23931634 Aug 2013 Epub date
1883              PMID: 23650371 23650371
2063                      PMID : 23773811
Name: PMID, dtype: object
df.at[17,"PMID"] = "23828613"
df.at[77,"PMID"] = "23409903"
df.at[166,"PMID"] = "23459248"
df.at[394,"PMID"] = "23041313"
df.at[654,"PMID"] = "23118027"
df.at[670,"PMID"] = "23763300"
df.at[741,"PMID"] = "23703681"
df.at[907,"PMID"] = "23846567"
df.at[1036,"PMID"] = "22360292"
df.at[1067,"PMID"] = "23945372"
df.at[1069,"PMID"] = "24048963"
df.at[1076,"PMID"] = "23962810"
df.at[1527,"PMID"] = "23931634"
df.at[1883,"PMID"] = "23650371"
df.at[2063,"PMID"] = "23773811"
df.PMID[ (df.PMID.str.len()!=8) & (df.PMID != "Unknown") ]
22                                                3535376
25                                             In Process
35                            print published August 2013
37                                                3748854
39                                                3633780
40                                             In Process
52                                                3516067
61                                                3813311
69                                                3567269
71                                    published Sept 2013
72                                                3947303
75                                                3708126
78                                                3331923
114                                               3706023
116                                     Not yet available
125                                               3971731
127                                            In Process
148                                               3685266
172                                               3654172
176                                               3847282
187                                               3245760
188                                               3452663
189                                               3510580
198                                               3558749
201                                               3744075
234                                               3442822
237                                               3526451
238                                               3681581
239                                               3708772
244                                               3639180
                              ...                        
1911                                               e-pub 
1912                             in press - due August 13
1929                                               543219
1933                                         PMID23824939
1934                                    Not yet available
1944                                              3281495
1945                                              3651613
1948                                              3635806
1953                                              3236081
1958                                                    -
1974                     Epub ahead of print - April 2013
1975    Epub ahead of print pub Dec 2012, print in pre...
1988                                              3778404
1993                                              3287458
1995                                         PMID23278317
1997                                    Not yet available
2000                                   22897899 MC3489778
2001                                              3703895
2003                                              3223522
2004                                              3813980
2016                                              3635120
2017                                                    -
2018                                              3649934
2037                                    Not yet available
2073                                              3744763
2075                                              3305527
2079                                         pub Aug 2013
2086                                       print in press
2109                                              4035434
2123                                         pub Aug 2013
Name: PMID, Length: 350, dtype: object
df.PMID[ (df.PMID.str.len()!=8) & (df.PMID != "Unknown") & (df.PMID.str.startswith("PMID"))]
1263    PMID24069883
1264    PMID24069886
1265    PMID24069913
1521    PMID23790101
1707    PMID22970197
1906    PMID23184293
1933    PMID23824939
1995    PMID23278317
Name: PMID, dtype: object
def stripPMID(id):
    if len(id)==12 and id[0:4]=="PMID":
        return id[5:]
    else:
        return id
df["PMID"] = df["PMID"].apply(stripPMID)
df.PMID[ (df.PMID.str.len()!=8) & (df.PMID != "Unknown")]
22                                                3535376
25                                             In Process
35                            print published August 2013
37                                                3748854
39                                                3633780
40                                             In Process
52                                                3516067
61                                                3813311
69                                                3567269
71                                    published Sept 2013
72                                                3947303
75                                                3708126
78                                                3331923
114                                               3706023
116                                     Not yet available
125                                               3971731
127                                            In Process
148                                               3685266
172                                               3654172
176                                               3847282
187                                               3245760
188                                               3452663
189                                               3510580
198                                               3558749
201                                               3744075
234                                               3442822
237                                               3526451
238                                               3681581
239                                               3708772
244                                               3639180
                              ...                        
1911                                               e-pub 
1912                             in press - due August 13
1929                                               543219
1933                                              3824939
1934                                    Not yet available
1944                                              3281495
1945                                              3651613
1948                                              3635806
1953                                              3236081
1958                                                    -
1974                     Epub ahead of print - April 2013
1975    Epub ahead of print pub Dec 2012, print in pre...
1988                                              3778404
1993                                              3287458
1995                                              3278317
1997                                    Not yet available
2000                                   22897899 MC3489778
2001                                              3703895
2003                                              3223522
2004                                              3813980
2016                                              3635120
2017                                                    -
2018                                              3649934
2037                                    Not yet available
2073                                              3744763
2075                                              3305527
2079                                         pub Aug 2013
2086                                       print in press
2109                                              4035434
2123                                         pub Aug 2013
Name: PMID, Length: 350, dtype: object
df.at[2000,"PMID"]=22897899
df.at[2000,"PMCID"]="PMC3489778"

OK, I think that’s all we can recover for the ids without requerying the database.

df.head()
PMCID Publisher Journal Title Article Title TotalCost PMID
0 PMC3378987 Elsevier Academy of Nutrition and Dietetics Parent support and parent mediated behaviours ... 2379.54 Unknown
1 PMC3780468 ACS (Amercian Chemical Society) Publications ACS Chemical Biology A Novel Allosteric Inhibitor of the Uridine Di... 1294.59 Unknown
2 PMC3621575 ACS (Amercian Chemical Society) Publications ACS Chemical Biology Chemical proteomic analysis reveals the drugab... 1294.78 Unknown
3 Unknown American Chemical Society ACS Chemical Biology Discovery of β2 Adrenergic Receptor Ligands Us... 947.07 Unknown
4 PMC3833349 American Chemical Society ACS Chemical Biology Discovery of an allosteric inhibitor binding s... 1267.76 24015914
df.Publisher[df.Publisher.str.contains("sevier")] = "Elsevier"
/Users/skaplan/Library/Python/2.7/lib/python/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
df.Publisher[df.Publisher.str.contains("iley")].value_counts()
Wiley                                 136
Wiley-Blackwell                        56
Wiley Subscription Services  Inc.      13
Wiley                                  12
Wiley Subscription Services             8
John Wiley & Sons Ltd                   8
John Wiley & Sons                       6
Wiley-VCH                               6
John Wiley                              4
Wiley Blackwell                         4
John Wiley & Sons Inc                   2
Wiley Subscription Services Inc         2
John Wiley and Sons Ltd                 1
Blackwell Publishing Ltd/Wiley          1
Wiley VCH                               1
Wiley Subscription Serviices Inc        1
John Wiley & Sons, Inc.                 1
Wiley-Blackwell, John Wiley & Sons      1
Wiley/Blackwell                         1
Wiley & Son                             1
Wiley Online Library                    1
John Wiley and Sons                     1
Name: Publisher, dtype: int64
df.Publisher[df.Publisher.str.contains("iley")] = "Wiley-Blackwell"
/Users/skaplan/Library/Python/2.7/lib/python/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
df.Publisher.value_counts()
Elsevier                                                            403
Public Library of Science                                           278
Wiley-Blackwell                                                     267
Springer                                                             81
Oxford University Press                                              77
OUP                                                                  56
ASBMB                                                                46
Nature Publishing Group                                              45
BioMed Central                                                       40
BMC                                                                  26
Nature                                                               24
Frontiers                                                            23
BMJ                                                                  23
Royal Society                                                        22
Cambridge University Press                                           20
Company of Biologists                                                18
American Society for Biochemistry and Molecular Biology              18
Dartmouth Journal Services                                           17
Oxford Journals                                                      16
American Society for Microbiology                                    16
PLoS                                                                 15
National Academy of Sciences                                         15
American Chemical Society                                            14
Society for General Microbiology                                     14
Landes Bioscience                                                    14
BMJ Group                                                            13
American Psychological Association                                   13
Sage                                                                 12
Taylor & Francis                                                     11
Society for Neuroscience                                             11
                                                                   ... 
The American Physiological Society                                    1
American Psychiatric Association                                      1
Oxford Univesity Press                                                1
American Speech-Language-Hearing Association                          1
ASBMC /CENVEO                                                         1
Informa Healthcare communications                                     1
Wliey-Blackwell                                                       1
Benthan Science Publishers                                            1
Company of Biologists Ltd                                             1
The American Society for Biochemistry and Molecular Biology, Inc      1
SOCIETY OF NEURO SCIENCES                                             1
Cold Spring Harbor                                                    1
International Union Against tuberculosis and Lung Disease             1
Association for Research in Vision & Ophthalmology                    1
Springer-Verlag GmbH, Heidelberger Platz 3, D-14197 Berlin            1
J Med Internet Research                                               1
American Chemical Society Publications                                1
The company of Biologists                                             1
public.service.co.uk                                                  1
American Society for Microbiology                                     1
Sage Publications Inc                                                 1
BMJ Publishing Group Ltd & British Thoracic Society                   1
Springer                                                              1
IOS Press                                                             1
Impact Journals LLC                                                   1
Sage Publications                                                     1
Wolters Kluwers                                                       1
ASM (American Society for Microbiology)                               1
CADMUS JOURNAL SERVICES                                               1
Hindawi                                                               1
Name: Publisher, Length: 271, dtype: int64
df.Publisher[df.Publisher.str.contains("pringer")].value_counts()
Springer                                                      81
Springer - Verlag GMBH                                         2
Springer-Verlag GmbH                                           2
Springer-Verlag GmbH, Heidelberger Platz 3, D-14197 Berlin     1
Springer Verlag                                                1
Springer                                                       1
Springer Science + Business Media                              1
Springer-Veriag GmbH                                           1
Springer-Verlag GMBH & Ci                                      1
Humana Press (Springer Imprint)                                1
Name: Publisher, dtype: int64
df.Publisher[df.Publisher.str.contains("pringer")]="Springer"
/Users/skaplan/Library/Python/2.7/lib/python/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
df.Publisher[df.Publisher.str.contains("xford")].value_counts()
Oxford University Press          77
Oxford Journals                  16
Oxford University Press (OUP)     4
Oxford Univ Press                 4
Oxford University Press\n         3
Oxford University Press           3
Oxford Univesity Press            1
Oxford Journals (OUP)             1
Name: Publisher, dtype: int64
df.Publisher[df.Publisher.str.contains("xford")]="Oxford University Press"
/Users/skaplan/Library/Python/2.7/lib/python/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
df.Publisher[df.Publisher=="OUP"]="Oxford University Press"
/Users/skaplan/Library/Python/2.7/lib/python/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
df.Publisher[df.Publisher=="ASBMB"]="American Society for Biochemistry and Molecular Biology"
/Users/skaplan/Library/Python/2.7/lib/python/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
df.Publisher[df.Publisher.str.contains("BMJ")] = "BMJ"
/Users/skaplan/Library/Python/2.7/lib/python/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
df.Publisher.value_counts()
Elsevier                                                      403
Public Library of Science                                     278
Wiley-Blackwell                                               267
Oxford University Press                                       165
Springer                                                       92
American Society for Biochemistry and Molecular Biology        64
BMJ                                                            56
Nature Publishing Group                                        45
BioMed Central                                                 40
BMC                                                            26
Nature                                                         24
Frontiers                                                      23
Royal Society                                                  22
Cambridge University Press                                     20
Company of Biologists                                          18
Dartmouth Journal Services                                     17
American Society for Microbiology                              16
PLoS                                                           15
National Academy of Sciences                                   15
American Chemical Society                                      14
Landes Bioscience                                              14
Society for General Microbiology                               14
American Psychological Association                             13
Sage                                                           12
Taylor & Francis                                               11
Society for Neuroscience                                       11
Portland Press Ltd                                             10
Portland Press                                                 10
BioMed Central                                                  9
Biomed Central                                                  8
                                                             ... 
Future Medicine Ltd                                             1
IOS Press                                                       1
Darmouth Journal Services                                       1
The Boulevard                                                   1
Journal of Visualized Experiments                               1
Society for Neurosciences                                       1
The American Physiological Society                              1
American Psychiatric Association                                1
ASBMC /CENVEO                                                   1
Impact Journals LLC                                             1
Transcript Verlag                                               1
The Journal of Visualized Experiments                           1
SOCIETY OF NEURO SCIENCES                                       1
Palgrave MacMillan                                              1
American Soc for Biochemistry and Molecular Biology             1
Society of General Microbiology                                 1
AGA Institute                                                   1
Royal Society for Chemistry                                     1
Cambridge Uni Press                                             1
Informa Healthcare                                              1
International Union Against tuberculosis and Lung Disease       1
Sage Publications Inc                                           1
Association for Research in Vision & Ophthalmology              1
Policy Press                                                    1
THE COMPANY OF BIOLOGISTS                                       1
Federation of the American Society of Experimental Biology      1
American Chemical Society Publications                          1
public.service.co.uk                                            1
American Public Health Association                              1
American Society for Microbiology                               1
Name: Publisher, Length: 244, dtype: int64
df.Publisher[(df.Publisher.str.contains("ACS")) | (df.Publisher.str.contains("Chemical"))] = "American Chemical Society"
/Users/skaplan/Library/Python/2.7/lib/python/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
df.Publisher.value_counts()
Elsevier                                                      403
Public Library of Science                                     278
Wiley-Blackwell                                               267
Oxford University Press                                       165
Springer                                                       92
American Society for Biochemistry and Molecular Biology        64
BMJ                                                            56
Nature Publishing Group                                        45
BioMed Central                                                 40
American Chemical Society                                      33
BMC                                                            26
Nature                                                         24
Frontiers                                                      23
Royal Society                                                  22
Cambridge University Press                                     20
Company of Biologists                                          18
Dartmouth Journal Services                                     17
American Society for Microbiology                              16
PLoS                                                           15
National Academy of Sciences                                   15
Society for General Microbiology                               14
Landes Bioscience                                              14
American Psychological Association                             13
Sage                                                           12
Society for Neuroscience                                       11
Taylor & Francis                                               11
Portland Press Ltd                                             10
Portland Press                                                 10
BioMed Central                                                  9
CUP                                                             8
                                                             ... 
Future Medicine Ltd                                             1
NATURE PUBLISHING GROUP LTD                                     1
The Boulevard                                                   1
Journal of Visualized Experiments                               1
Society for Neurosciences                                       1
The American Physiological Society                              1
American Psychiatric Association                                1
National Academy of Sciences                                    1
ASBMC /CENVEO                                                   1
Informa Healthcare communications                               1
Wliey-Blackwell                                                 1
Cold Spring Harbor Publications                                 1
The royal Society                                               1
CADMUS JOURNAL SERVICES                                         1
public.service.co.uk                                            1
Federation of the American Society of Experimental Biology      1
Policy Press                                                    1
THE COMPANY OF BIOLOGISTS                                       1
University of the Basque Country Press                          1
Cadmus                                                          1
American Public Health Association                              1
American Society for Microbiology                               1
Darmouth Journal Services                                       1
Sage Publications Inc                                           1
Transcript Verlag                                               1
IOS Press                                                       1
Impact Journals LLC                                             1
Sage Publications                                               1
Wolters Kluwers                                                 1
ASBMB/Cadmus                                                    1
Name: Publisher, Length: 239, dtype: int64
df.Publisher[df.Publisher.str.contains("BMC")] = "BioMed Central"
/Users/skaplan/Library/Python/2.7/lib/python/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
df.head()
PMCID Publisher Journal Title Article Title TotalCost PMID
0 PMC3378987 Elsevier Academy of Nutrition and Dietetics Parent support and parent mediated behaviours ... 2379.54 Unknown
1 PMC3780468 American Chemical Society ACS Chemical Biology A Novel Allosteric Inhibitor of the Uridine Di... 1294.59 Unknown
2 PMC3621575 American Chemical Society ACS Chemical Biology Chemical proteomic analysis reveals the drugab... 1294.78 Unknown
3 Unknown American Chemical Society ACS Chemical Biology Discovery of β2 Adrenergic Receptor Ligands Us... 947.07 Unknown
4 PMC3833349 American Chemical Society ACS Chemical Biology Discovery of an allosteric inhibitor binding s... 1267.76 24015914
df["Journal"] = df["Journal Title"]
df.drop("Journal Title",axis=1,inplace=True)
df.head()
PMCID Publisher Article Title TotalCost PMID Journal
0 PMC3378987 Elsevier Parent support and parent mediated behaviours ... 2379.54 Unknown Academy of Nutrition and Dietetics
1 PMC3780468 American Chemical Society A Novel Allosteric Inhibitor of the Uridine Di... 1294.59 Unknown ACS Chemical Biology
2 PMC3621575 American Chemical Society Chemical proteomic analysis reveals the drugab... 1294.78 Unknown ACS Chemical Biology
3 Unknown American Chemical Society Discovery of β2 Adrenergic Receptor Ligands Us... 947.07 Unknown ACS Chemical Biology
4 PMC3833349 American Chemical Society Discovery of an allosteric inhibitor binding s... 1267.76 24015914 ACS Chemical Biology
df.Journal.value_counts()
PLoS One                                           92
PLoS ONE                                           62
Journal of Biological Chemistry                    48
Nucleic Acids Research                             21
Proceedings of the National Academy of Sciences    19
PLoS Neglected Tropical Diseases                   18
Human Molecular Genetics                           18
Nature Communications                              17
PLoS Genetics                                      15
PLoS Pathogens                                     15
Neuroimage                                         15
PLOS ONE                                           14
BMC Public Health                                  14
Brain                                              14
NeuroImage                                         14
Movement Disorders                                 13
Biochemical Journal                                12
Developmental Cell                                 12
Journal of Neuroscience                            12
Journal of General Virology                        11
BMJ                                                10
Current Biology                                    10
PLOS One                                           10
Cell Reports                                        9
Neuron                                              9
Journal of Physiology                               8
Journal of Cell Science                             8
European Journal of Immunology                      8
Molecular Microbiology                              8
Hepatology                                          8
                                                   ..
Public Service Review                               1
Frontiers in Neurorobotics                          1
Frontiers in Decision Neuroscience                  1
Cognitive Therapy and Research                      1
PLoS                                                1
Current Opinion Microbiology                        1
Int Journal for Parasitology                        1
Frontiers in Invertebrate Physiology                1
Synapse                                             1
PNTD                                                1
Journal of Archaeological Science                   1
Journal of Alzheimer Disease                        1
International Immunology                            1
The journal of Biological Chemistry                 1
EvoDevo                                             1
Reproductive Sciences                               1
International Reviews of Immunology                 1
Theranostics                                        1
Journal of Epidemiology & Community Health          1
ASN Neuro                                           1
Clinical Radiology                                  1
Speech Language and Hearing Research                1
PLoS Medicine Journal                               1
Cerebral Cortex print                               1
Tissue Engineering: part A                          1
Emotion                                             1
Arthritis Research and Therapy                      1
The Vet. Journal                                    1
Journal of Transport Geography                      1
American Jnl Epidemiology                           1
Name: Journal, Length: 984, dtype: int64
df.Journal[(df.Journal.str.lower() == "plos") | (df.Journal.str.lower() == "plos 1") | (df.Journal.str.lower() == "plos one")] = "PLoS One"
/Users/skaplan/Library/Python/2.7/lib/python/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
df.head()
PMCID Publisher Article Title TotalCost PMID Journal
0 PMC3378987 Elsevier Parent support and parent mediated behaviours ... 2379.54 Unknown Academy of Nutrition and Dietetics
1 PMC3780468 American Chemical Society A Novel Allosteric Inhibitor of the Uridine Di... 1294.59 Unknown ACS Chemical Biology
2 PMC3621575 American Chemical Society Chemical proteomic analysis reveals the drugab... 1294.78 Unknown ACS Chemical Biology
3 Unknown American Chemical Society Discovery of β2 Adrenergic Receptor Ligands Us... 947.07 Unknown ACS Chemical Biology
4 PMC3833349 American Chemical Society Discovery of an allosteric inhibitor binding s... 1267.76 24015914 ACS Chemical Biology
df = df[["PMID","PMCID","Article Title","Journal","Publisher","TotalCost"]]
df.head()
PMID PMCID Article Title Journal Publisher TotalCost
0 Unknown PMC3378987 Parent support and parent mediated behaviours ... Academy of Nutrition and Dietetics Elsevier 2379.54
1 Unknown PMC3780468 A Novel Allosteric Inhibitor of the Uridine Di... ACS Chemical Biology American Chemical Society 1294.59
2 Unknown PMC3621575 Chemical proteomic analysis reveals the drugab... ACS Chemical Biology American Chemical Society 1294.78
3 Unknown Unknown Discovery of β2 Adrenergic Receptor Ligands Us... ACS Chemical Biology American Chemical Society 947.07
4 24015914 PMC3833349 Discovery of an allosteric inhibitor binding s... ACS Chemical Biology American Chemical Society 1267.76
df.TotalCost.hist(bins=30)
<matplotlib.axes._subplots.AxesSubplot at 0x110fe5350>

png

Whoa, what study cost more than 12000 GBP?!?

df[df.TotalCost>12000]
PMID PMCID Article Title Journal Publisher TotalCost
1341 Unknown Unknown Fungal Disease in Britain and the United State... NaN MacMillan 13200.0
df[df.TotalCost>12000]["Article Title"][1341]
u'Fungal Disease in Britain and the United States 1850-2000'