Jekyll2018-03-03T21:23:10+00:00https://laurachen.github.io/Laura ChenData Scientist in New York CityLaura Chenlaurachen24@gmail.comThe Truth About Secrets2018-03-01T00:00:00+00:002018-03-01T00:00:00+00:00https://laurachen.github.io/projects/project-secrets<h4 id="uncovering-hidden-themes-in-anonymous-social-media-through-natural-language-processing">Uncovering hidden themes in anonymous social media through Natural Language Processing</h4>
<blockquote>
<p>Nothing makes us so lonely as our secrets. – Paul Tournier</p>
</blockquote>
<p>Secrets can take so many forms: from lighthearted to intense, regrettable to silly, and everything in between. I became interested in studying secrets because I feel you can learn a lot about society from the things that people are afraid to admit in public. By definition, you rarely get to peek behind the curtain, but people become surprisingly honest and vulnerable when you give them the benefit of anonymity. I collected the data for my analysis from <a href="https://whisper.sh">Whisper</a>, an anonymous social media platform for sharing secrets.</p>
<p><img src="/assets/whisper-data.jpg" alt="Data about secrets" /></p>
<p>You can see what people tend to talk about from a high level in this word cloud, but let’s dig a little deeper.</p>
<p><img src="/assets/wc1.png" alt="Word Cloud" /></p>
<h1 id="sentiment-analysis">Sentiment Analysis</h1>
<p>I used VADER Sentiment Analysis to get the sentiment scores for all secrets in my dataset. A higher score means the message is generally more positive, while a lower score means it is more negative.</p>
<p>I was surprised to find that the secrets were generally fairly balanced, with most falling on the neutral or slightly negative side. The image below shows the exact breakdown of the scores as well as some examples of messages within those groups.</p>
<p><img src="/assets/secret-sentiment-score.png" alt="Compound sentiment score" /></p>
<h1 id="topic-modeling">Topic Modeling</h1>
<p>I extracted topics from the data using Non-negative Matrix Factorization (NMF). I summarized the top 10 topics across the entire dataset as follows:</p>
<ol>
<li>Parents/family</li>
<li>Sex</li>
<li>Jobs/Money</li>
<li>Exes</li>
<li>Hatred</li>
<li>Marriage</li>
<li>Kids</li>
<li>Pregnancy</li>
<li>High School/College</li>
<li>Dating</li>
</ol>
<p>I used the highest NMF coefficient for each secret to classify it into a topic and found that the topics for parents/family, jobs/money, and exes were by far the most common. However, many secrets had extremely low NMF coefficients (less than .00001) so I had to take this interpretation with a grain of salt. I experimented with my topics, expanding them beyond interpretability, and played with clustering models such as DBSCAN and K-means. I couldn’t seem to get any meaningful results for those noisy points. As I investigated them, I found that there were simply a lot of secrets along the lines of “It feels like a sin to wash bacon grease off a pan”.</p>
<p>Interestingly, I found a very different distribution of topics among positive and negative secrets from my sentiment analysis. Understandably, the positive secrets are people gushing about happy moments that they want to share. The people’s secrets with negative tones tend to be about marriage, work, and things they hate.</p>
<p><img src="/assets/secret-topic-dist.png" alt="Distribution of Topics by Sentiment" /></p>
<h1 id="web-app-mockup">Web App Mockup</h1>
<p>There is some comfort in knowing that you’re not alone when you are struggling with a heavy secret. I made a simple mockup of a site where users can type their secret and the model will pull up the most similar secrets by cosine similarity.</p>
<p><img src="/assets/shared-secrets-app.png" alt="shared secrets app" /></p>
<h1 id="conclusion">Conclusion</h1>
<p>There is a lot to learn from secrets. I would like to continue to explore them from additional angles. Some other ideas I have include building a supervised learning model to classify the higher themes of secrets (such as insecurities or regrets). I would also like to study secrets of people from other countries to understand how taboos may vary by cultures. Finally, I think it would be fascinating to compare my findings from Whisper to a public social media site such as Instagram or Facebook. When you look at a word like “pregnancy”, for example, you’ll see that it carries a lot of baggage on an anonymous site like Whisper. However, it is a topic that many people discuss positively, or even boastfully, on Facebook.</p>Laura Chenlaurachen24@gmail.comUncovering hidden themes in anonymous social media through Natural Language Processing Nothing makes us so lonely as our secrets. – Paul Tournier Secrets can take so many forms: from lighthearted to intense, regrettable to silly, and everything in between. I became interested in studying secrets because I feel you can learn a lot about society from the things that people are afraid to admit in public. By definition, you rarely get to peek behind the curtain, but people become surprisingly honest and vulnerable when you give them the benefit of anonymity. I collected the data for my analysis from Whisper, an anonymous social media platform for sharing secrets. You can see what people tend to talk about from a high level in this word cloud, but let’s dig a little deeper. Sentiment Analysis I used VADER Sentiment Analysis to get the sentiment scores for all secrets in my dataset. A higher score means the message is generally more positive, while a lower score means it is more negative. I was surprised to find that the secrets were generally fairly balanced, with most falling on the neutral or slightly negative side. The image below shows the exact breakdown of the scores as well as some examples of messages within those groups. Topic Modeling I extracted topics from the data using Non-negative Matrix Factorization (NMF). I summarized the top 10 topics across the entire dataset as follows: Parents/family Sex Jobs/Money Exes Hatred Marriage Kids Pregnancy High School/College Dating I used the highest NMF coefficient for each secret to classify it into a topic and found that the topics for parents/family, jobs/money, and exes were by far the most common. However, many secrets had extremely low NMF coefficients (less than .00001) so I had to take this interpretation with a grain of salt. I experimented with my topics, expanding them beyond interpretability, and played with clustering models such as DBSCAN and K-means. I couldn’t seem to get any meaningful results for those noisy points. As I investigated them, I found that there were simply a lot of secrets along the lines of “It feels like a sin to wash bacon grease off a pan”. Interestingly, I found a very different distribution of topics among positive and negative secrets from my sentiment analysis. Understandably, the positive secrets are people gushing about happy moments that they want to share. The people’s secrets with negative tones tend to be about marriage, work, and things they hate.Supervised Learning Models and Chocolate2018-02-17T00:00:00+00:002018-02-17T00:00:00+00:00https://laurachen.github.io/projects/project-chocolate<p>The most convenient part about building a supervised learning model for chocolate is that you can stress-eat in the name of “research”. Jokes aside, I focused this project on classifying chocolate bars based on expert reviews from the Manhattan Chocolate Society. Whether you are trying to break into the chocolate industry, or you’re just a connoisseur looking for the next delicious bar to eat, hopefully these findings will be of interest.</p>
<p>The Manhattan Society of Chocolate maintains <a href="http://flavorsofcacao.com/chocolate_database.html">a database</a> with information about more than 1,800 chocolate bars as of Feb 2018. My goal with this project was to determine the factors that would lead to a highly rated bar of chocolate. The official ratings provided in the database were from 1 to 5, but I grouped the data into Low, Medium, and High tiers to even out the classes.</p>
<p><img src="\assets\choco-pie.jpg" alt="choco-pie" /></p>
<h3 id="which-factors-mattered">Which factors mattered?</h3>
<p>The most important predictor of high quality chocolate ended up being whether the bar was crafted from Criollo beans. Upon further research, I found that this bean is considered one of the most prized in the world for its rarity and flavor. Due to the rarity, I would also infer that chocolatiers who use this bean are probably more likely to meticulously craft their recipes to perfection. They are definitely not using this in your average Hershey’s bar.</p>
<p>I found it interesting that the location of the company processing the chocolate actually had a much bigger impact on the chocolate taste than the region where the beans were sourced. I created a D3 visualization to show a the average chocolate bar rating from different countries around the world. Darker colors represent a higher rating. I saw that Latin America and parts of southern Europe were producing many of the best bars in the world.</p>
<p><img src="\assets\map-of-chocolate.jpg" alt="Map of Chocolate Production" /><br />
<a href="https://bl.ocks.org/LauraChen/raw/e6cf35cd59ed0a756467d9d35a7f0682/">Click for the full interactive version</a></p>
<p>Why does location of production seem to matter more than bean origin? I interpret this two ways:</p>
<ol>
<li>Cocoa beans require very particular growing conditions. Thus, many different companies end up sourcing their beans from the same few locations. Within the sample, the vast majority of beans were sourced from Latin America and the Caribbean (about 70%). As a result, vastly different bars are showing up as from the same origin and the effect of bean origin is minimized.</li>
<li>Local processing techniques, recipes, and culture/tastes can have a large impact on the final product.</li>
</ol>
<p>Between the two, I believe #1 is likely the more significant reason. I would love to see additional data to further investigate these ideas.</p>
<h3 id="technical-details">Technical Details</h3>
<p>The tiers of chocolate ratings were quite imbalanced at the start, so I had to upsample my training set to increase precision of the model. I experimented with several models such as Logistic Regression, Naive Bayes, and Support Vector Classification, but my ultimate winner was the Random Forest model. As I further tuned the model, I prioritized precision over accuracy and recall because I’d rather let some highly rated chocolates slip through the cracks than accidentally predict that Low rated chocolates are High.</p>
<p>In the end, my final model had about 48% precision and 41% recall for predicting High ratings for chocolates. While the performance was not the absolute best, I am fairly pleased with it considering how few features I was able to include.</p>
<h3 id="areas-of-further-exploration">Areas of further exploration</h3>
<p>I think there are a ton of factors that go into a chocolate bar’s overall experience. It would be interesting to see more data on other qualities such as acidity, bitterness, texture, etc. I would like to see additional data on percentages of other ingredients such as sugar, vanilla, and emulsifiers. It would also be great to have details on processing techniques and agricultural techniques.</p>
<p>The ability for a chocolate bar to receive a High rating from a panel of experts is one thing, but what about flavor ratings from real customers? It would be interesting to see how flavor preferences could be used to recommend chocolates that a specific type of person would enjoy. I think it would also be interesting to investigate sales numbers of different chocolate bars to see which ones are the most profitable.</p>Laura Chenlaurachen24@gmail.comThe most convenient part about building a supervised learning model for chocolate is that you can stress-eat in the name of “research”. Jokes aside, I focused this project on classifying chocolate bars based on expert reviews from the Manhattan Chocolate Society. Whether you are trying to break into the chocolate industry, or you’re just a connoisseur looking for the next delicious bar to eat, hopefully these findings will be of interest. The Manhattan Society of Chocolate maintains a database with information about more than 1,800 chocolate bars as of Feb 2018. My goal with this project was to determine the factors that would lead to a highly rated bar of chocolate. The official ratings provided in the database were from 1 to 5, but I grouped the data into Low, Medium, and High tiers to even out the classes. Which factors mattered? The most important predictor of high quality chocolate ended up being whether the bar was crafted from Criollo beans. Upon further research, I found that this bean is considered one of the most prized in the world for its rarity and flavor. Due to the rarity, I would also infer that chocolatiers who use this bean are probably more likely to meticulously craft their recipes to perfection. They are definitely not using this in your average Hershey’s bar. I found it interesting that the location of the company processing the chocolate actually had a much bigger impact on the chocolate taste than the region where the beans were sourced. I created a D3 visualization to show a the average chocolate bar rating from different countries around the world. Darker colors represent a higher rating. I saw that Latin America and parts of southern Europe were producing many of the best bars in the world. Click for the full interactive version Why does location of production seem to matter more than bean origin? I interpret this two ways: Cocoa beans require very particular growing conditions. Thus, many different companies end up sourcing their beans from the same few locations. Within the sample, the vast majority of beans were sourced from Latin America and the Caribbean (about 70%). As a result, vastly different bars are showing up as from the same origin and the effect of bean origin is minimized. Local processing techniques, recipes, and culture/tastes can have a large impact on the final product. Between the two, I believe #1 is likely the more significant reason. I would love to see additional data to further investigate these ideas. Technical Details The tiers of chocolate ratings were quite imbalanced at the start, so I had to upsample my training set to increase precision of the model. I experimented with several models such as Logistic Regression, Naive Bayes, and Support Vector Classification, but my ultimate winner was the Random Forest model. As I further tuned the model, I prioritized precision over accuracy and recall because I’d rather let some highly rated chocolates slip through the cracks than accidentally predict that Low rated chocolates are High. In the end, my final model had about 48% precision and 41% recall for predicting High ratings for chocolates. While the performance was not the absolute best, I am fairly pleased with it considering how few features I was able to include. Areas of further exploration I think there are a ton of factors that go into a chocolate bar’s overall experience. It would be interesting to see more data on other qualities such as acidity, bitterness, texture, etc. I would like to see additional data on percentages of other ingredients such as sugar, vanilla, and emulsifiers. It would also be great to have details on processing techniques and agricultural techniques. The ability for a chocolate bar to receive a High rating from a panel of experts is one thing, but what about flavor ratings from real customers? It would be interesting to see how flavor preferences could be used to recommend chocolates that a specific type of person would enjoy. I think it would also be interesting to investigate sales numbers of different chocolate bars to see which ones are the most profitable.Web Scraping for the Greater Good2018-01-28T00:00:00+00:002018-01-28T00:00:00+00:00https://laurachen.github.io/projects/web-scraping<p>One of the coolest parts of studying data science is applying it to real life questions. One thing I’ve thought about a lot through my experiences with Net Impact at NYU and during consulting projects at nonprofits is the concept of how to run charities efficiently. When I found out our second project was on web scraping and linear regressions, I decided to do an analysis of charities.</p>
<p><img src="https://media.giphy.com/media/I6rM4juHFpL3y/giphy.gif" alt="One of the best scences in the movie" /></p>
<p>I pulled data from the <a href="https://www.charitynavigator.org/">Charity Navigator website</a> and created a model to predict what percentage of a charity’s total expenses actually go towards charitable programs versus administrative and fundraising expenses. I do recognize that this metric is not entirely fair. I would much prefer to measure the total benefit provided by a charity such as the number of adoptions facilitated or the number of vaccinations administered. Since it is only a percentage, it also does not take into account the total benefit of a charity. Nonetheless, this is the metric that was available and I think it serves the purpose for this analysis.</p>
<p>The model is far from perfect, with an R-squared of .57, but it was interesting to see which factors mattered the most in my model. Unsurprisingly, I found that organizations that have transparent reporting policies and accept government grants tend to allocate more of their budget to programs. I saw a negative relationship between my dependent variable and fundraising expenses, which also makes a lot of sense. I also saw that charities that offer services for a paid fee spend less on their charitable programs. My hypothesis would be that these organizations are spending more on overhead expenses such as real estate and equipment.</p>
<p>As I mentioned, the % of a charity’s budget comprising of program expenses is just one metric but it is not totally giving the full picture. It would be great to have more data to incorporate into the model such as the size of the charity or whether it operates on a national or local level.</p>Laura Chenlaurachen24@gmail.comOne of the coolest parts of studying data science is applying it to real life questions. One thing I’ve thought about a lot through my experiences with Net Impact at NYU and during consulting projects at nonprofits is the concept of how to run charities efficiently. When I found out our second project was on web scraping and linear regressions, I decided to do an analysis of charities.My first week at Metis2018-01-14T22:16:01+00:002018-01-14T22:16:01+00:00https://laurachen.github.io/personal/first-week<p>Coming from my job as an implementation consultant at Deloitte Digital and joining Metis has been such a whirlwind, but in the best way possible.</p>
<p>My typical day, in contrast:</p>
<blockquote>
<p>At Work:<br />
<strong>Me:</strong> Hey dude, can you make the system do this? While you’re at it, can you make this button blue and move it to the left side of the screen?<br />
<strong>Developer on my team:</strong> Done. Coffee time?</p>
</blockquote>
<blockquote>
<p>At Metis:<br />
<strong>Me:</strong> This would be a really cool algorithm to design! I wonder if I can build it to do what I want without it taking 392,749,327,538 years to run!<br />
<strong>Also me:</strong> #($&(!@&#@!!!!</p>
</blockquote>
<p>Currently, the ideas in my head exceed my technical ability to accomplish them. This can be frustrating at times, but it is the main reason I came to Metis: to bridge that gap.</p>
<p><img src="https://media.giphy.com/media/26xBI73gWquCBBCDe/giphy.gif" alt="My face half the time in class" /></p>
<p>One thing I love about data science is the creativity of working on open ended questions. This week, we tackled our very first project. Our goal was to recommend the best MTA stations for a nonprofit street team to conduct marketing activities. There were about a million things that our group wanted to investigate, and it felt like our biggest challenge was to reign ourselves in and select a manageable amount of analysis to conduct given the 1 week timeframe, the messy nature of the data, and our own learning curves.</p>
<p>That being said, I’m extremely proud of our team’s work. I’ve learned so much already and am extremely motivated to push myself further.</p>Laura Chenlaurachen24@gmail.comComing from my job as an implementation consultant at Deloitte Digital and joining Metis has been such a whirlwind, but in the best way possible. My typical day, in contrast: At Work: Me: Hey dude, can you make the system do this? While you’re at it, can you make this button blue and move it to the left side of the screen? Developer on my team: Done. Coffee time? At Metis: Me: This would be a really cool algorithm to design! I wonder if I can build it to do what I want without it taking 392,749,327,538 years to run! Also me: #($&(!@&#@!!!! Currently, the ideas in my head exceed my technical ability to accomplish them. This can be frustrating at times, but it is the main reason I came to Metis: to bridge that gap.