My Blog

What data does LinkedIn collect ? – Part I

I am more of a passive user on LinkedIn. I hardly like, share or post any status on LinkedIn. However, just out of curiosity I downloaded the data which LinkedIn stored for me and decided to analyze it. To get your data it is quite easy. Go to your LinkedIn account, go to settings, and select Privacy and go to download your Data. It will ask you to enter your LinkedIn password for security reasons.

a

After that it will notify that they will notify you once the data is ready to download. Then immediately within 10-15 minutes I received an email from LinkedIn saying that Part 1 of your data is ready for download. It also says that the second part of your data would be ready within 24 hours. It will redirect to your LinkedIn profile again to download the data from where you can download the zip file.

After unzipping the file, this is how the folder structure looks:

b

1) Videos.csv contained just one line https://www.linkedin.com/psettings/member-data/videos and when I went to that link it said “You haven’t uploaded any videos L “which is true. I haven’t uploaded any videos on LinkedIn.

2) Skills.csv contained all the skills which I had mentioned on the profile. Some of them were R, Data Analysis, Statistics, Data Science etc. The skills which other people endorse us for (even if we don’t have it 😛 ). I actually even expected them to store the count of how many people have endorsed the skill because that is an important number to keep. Anyway, if they are showing it on the profile they must be definitely be storing it somewhere.

3) Registration.csv had the details when I registered for the website, I suppose. I don’t actually clearly recall the date and time I signed up for LinkedIn but I am assuming this is correct. The other columns were blank.

c

From the IP address, I checked what details I can find out. A basic search reveals this details from IP.

d

So, I was at home when I signed up for LinkedIn.

4) Projects.csv includes the projects which you have added on LinkedIn along with its description, URL (if any), its start and end date as well.

e

5) Profile.csv maintains details of your profile which you have shared. Your name, address, birth date etc.

f

6) Positions.csv keeps a record of all your employment details. The organizations you have worked with, your title there, the duration for which you worked etc.

g

7) Phone Numbers.csv contained only my phone number in it.

8) Messages.csv had all the conversation/messages I had over LinkedIn. All the messages sent/received. One thing worth noticing was it had 1074 messages in total and the last message was from November 2014. It is hard to believe that I did not receive any messages from the time I signed up on LinkedIn from 2011 to 2014. Or do they show only top 1000 messages or so?

h

9) Languages.csv contains languages and their proficiency.

i

10) Invitations.csv contains information about all the invitations sent by you and received by you. The time the invitation was send and if any invitation message was sent along with it. Also this had around 2k rows which is way less than my total connections and my oldest connection was from 2017 so I believe even this has some filter in it like messages.csv.

j

11) Imported Contacts.csv has all the contacts which you have imported from your personal email address. First and Last Name of imported contact, their email address, when was the contact imported , their phone number (if any).

12) Email Addresses.csv includes your email address. I had two email address, one was primary and the other one secondary. It also has a flag if the email addresses are verified or not.

13) Education.csv like the Positions.csv has details of your education whatever you have uploaded.

k

14) Courses.csv includes all the courses you have taken which you have included on the platform.

15) Connections.csv Now this is I think the most important csv of all. This has got list of all your connections along with their email addresses, the company they work for, their position and the time when you were connected with them. So one thing we need to keep in mind is that when we are connecting with anybody we are giving away our email address to them.

16) Certifications.csv has the certificates which you have included.

17) Cause you Care About.csv This too is straight forward info which you have shared.

Media Files – has the media files that you have shared on the platform. Any images / document that is uploaded.

Jobs – This gets divided into two parts. One csv is for our job preference (Job Seeker Preference.csv) which even I don’t remember when I set it up. Says I am looking for job casually and ready to join in 4 to 6 months. The other csv which is Job Applications.csv has all the details of all the job you have applied it till now. The time when you applied, the title for which you applied, the company name and the resume name which you uploaded.

l

Advertisements

What am I working on recently?

I know I haven’t written a blog since a long time. I haven’t solved any Euler Problem since last time and the thing I am working on is a long never ending process. I usually write a blog once I finish doing something but as this was a continuous process I didn’t bother to write about it. However, I realized I should stop and just update on what’s going on and what I am working on.

I am building my web page. I know sharonak47.wordpress.com is already a web page and I could do many things with wordpress but it has been long since I have edited and modified anything on wordpress, so if I had to continue using it then I had to learn wordpress again and reinvent the wheel. I don’t mind doing that but then I thought learning and using wordpress would not be that effective for me in the long run. I don’t want to be a wordpress developer in near future. How would it be if I could do this in my favorite language? Yes, you guessed it right. I am building a website in R 😀 ;-). More or less it would just be a single page website. Shiny has become quite powerful and has grown a lot in the last couple of years. There are many amazing things one can do using shiny. Although, the web page I am building is pretty basic and won’t be using all the powers of shiny. The idea behind this website is to be a central place of all my web presence. So if anybody reaches my website will be able to find me anywhere (only online though :P).

You can find the website at https://shahronak.shinyapps.io/my_shiny_app/ (I know I need a better URL, but this is what shinnyapps provide 😛 ). This is still in development phase and you would see some changes in next few weeks. Now deciding what should I include in the website. I was pretty sure that I want to include my stack overflow answers in there (yes, I am pretty proud of them). There are quite a few packages in R to play with stack overflow API but most of them are half baked and old. I found stackr package from David Robinson which worked quite well. I provide my user_ID (3962914) and get recent answer_ID. Using that answer id’s I generate the URL for my answers by appending (https://stackoverflow.com/a/) to the answer_id. For display purpose I cannot display the entire answer body, so then I decided to display question title and then anchor it with answer url. We get question_id from the same call as above using stack_users function and we get the title of those questions from stack_questions functions. We display only recent 6 answers. That integrated pretty well. It updates as soon as I post any answer on Stack Overflow and doesn’t need to check anything manually.

I also was planning to display my top answers from Quora similarly. Unfortunately, there are no packages for Quora to do the same in R. Moreover, Quora does not even have an official API support which makes it more difficult. There are few packages in Python which allows to query Quora but nothing in R I could find. So at least, this is postponed as of now as there is no straight forward approach. Added a twitter widget from Twitter widgets which shows my recent tweets in the sidebar. This is the same which I used in wordpress site so this was pretty simple. Apart from all these I have added links to my profiles on various other platforms like Facebook, Linkedin, Github, Quora and WordPress blog. I also wanted to show titles of my recent blogs  from wordpress but even that is not straight forward. Further , I also plan to include somewhere my side projects like the couple of twitter bots which I have developed, bsetools and bsedata projects.

Gave it a resume kind of look by including my employment and educational history. Working on client side was more of working with HTML and CSS. I have some basic skills of that which I am using here. Let’s see how far I can take this ahead.

Euler Problem 58 – Spiral Primes

Spiral primes

Problem 58

Starting with 1 and spiralling anticlockwise in the following way, a square spiral with side length 7 is formed.

37 36 35 34 33 32 31
38 17 16 15 14 13 30
39 18  5  4  3 12 29
40 19  6  1  2 11 28
41 20  7  8  9 10 27
42 21 22 23 24 25 26
43 44 45 46 47 48 49

It is interesting to note that the odd squares lie along the bottom right diagonal, but what is more interesting is that 8 out of the 13 numbers lying along both diagonals are prime; that is, a ratio of 8/13 ≈ 62%.

If one complete new layer is wrapped around the spiral above, a square spiral with side length 9 will be formed. If this process is continued, what is the side length of the square spiral for which the ratio of primes along both diagonals first falls below 10%?

At first I didn’t get how was the spiral formed. I was confused about how the numbers were generated. After reading it the second time, I remembered there was something similar which we had dealt with previously. Blog for which is here and the code for which can be found here. With the help of that code, the main work was done to get the spiral matrix working to generate the sequence of four numbers on diagonal which we were interested in. Now the remaining work was to find out how many of those 4 numbers were actually prime and calculate the ratio of total number of primes on diagonal to total number of elements on diagonal.

So, we have two variables, previous_prime_count which keeps track of number of primes on diagonal and previous_length which keeps track of total number of elements on diagonal. At each iteration, we add 4 elements on diagonal irrespective anyway. We have the is_prime function which checks out of those four elements how many of them are prime and adds it to the previous_prime_count variable. We keep on generating new sequences using the formula until the ratio of previous_prime_count by previous_length goes below 0.1.

This was nice and simple program, however, I immediately started running into issues as soon as I ran this program. The initial 200 iterations were fine and were covered easily but it became extremely slow when the ratio was around 0.2. I thought it is almost done and reaching from 0.2 to 0.1 would hardly take any time but I was so wrong. To reach from 0.2 to 0.1 it took 3 days. 3 long days……yeah. I know it is crazy but it did take that much time. Initially I didn’t even get where the problem was as everything was being done at constant time so I was not sure where it is taking so much of time. However, I realized later that as the numbers increased it was taking more and more time in is_prime function. Turns out I have not written the most efficient is_prime function. Ideally, the loop should go from 2 to sqrt(n) to check if the number is prime or not but I have been checking it till n/2 which is a very big performance loss.

source("/Users/Ronak Shah/Google Drive/Git-Project-Euler/1-10/3.Check_if_Prime.R")

flag = TRUE
i = 0
previous_prime_count = 0
previous_length = 1
while(flag) {
 i = i + 1
 constant = 2 * i
 length_i = constant + 1
 second_number = 4 * (i^2) + 1
 first_number = second_number - constant
 third_number = second_number + constant
 fourth_number = third_number + constant
 previous_prime_count = previous_prime_count + sum(is_prime(c(first_number, 
 second_number, third_number, fourth_number)))
 previous_length = previous_length + 4
 ratio = previous_prime_count / previous_length
 cat(ratio, length_i, "\n")
 if (ratio < 0.1)
  flag = FALSE
}

previous_length
#[1] 26241

 

Euler Problem 57 – Square Root Convergents

Square root convergents

Problem 57

It is possible to show that the square root of two can be expressed as an infinite continued fraction.

√ 2 = 1 + 1/(2 + 1/(2 + 1/(2 + … ))) = 1.414213…

By expanding this for the first four iterations, we get:

1 + 1/2 = 3/2 = 1.5
1 + 1/(2 + 1/2) = 7/5 = 1.4
1 + 1/(2 + 1/(2 + 1/2)) = 17/12 = 1.41666…
1 + 1/(2 + 1/(2 + 1/(2 + 1/2))) = 41/29 = 1.41379…

The next three expansions are 99/70, 239/169, and 577/408, but the eighth expansion, 1393/985, is the first example where the number of digits in the numerator exceeds the number of digits in the denominator.

In the first one-thousand expansions, how many fractions contain a numerator with more digits than denominator?

At first, this looked like a complicated question. I was scared of large decimal numbers. After thinking for a while I noticed one thing that at every step we are adding ½ to the previous solution which is constant. So the sequence which is generated should be deterministic. I kept the numerator and denominator separately, no need to keep them as a fraction and complicate things. First concentrating on numerator, the sequence is 3, 7, 17, 41, 99, 239 .. I tried to create a formula which generates the following sequence. I was unsuccessful but then I checked on oeis website to understand the sequence where I found A001333 sequence which gives the formula for generating the numerator sequence which is

A(n) = 2 * A(n-1) + A(n-2)

Upon testing this clearly satisfied the numbers shown. Now, when I checked for sequence of denominators it gave me a different sequence but with the same formula. It was surprising to know that both numerators and denominators were being generated using the same formula. As everything was deterministic then I thought even the index when number of characters in numerator is greater than denominator should be deterministic as well. I printed out all such indexes,

8, 13, 21, 26, 34, 39, 47, 55, 60, 68, 73

were the indexes. There is also a page on this sequence as well. However, unfortunately there is no formula to generate this sequence. If there was a formula then we could have generated the sequence till 10000 and calculated the number of elements in it and that would have been the answer.

However, as the formula is not available we use the traditional approach to count number of elements where number of characters in numerator is greater than denominator. We need two initial numbers to generate the next number which we can get from the example itself. For numerator, we use 3 and 7 whereas for denominator we use 2 and 5. We generate the next number using the formula :
For numerator :

A(n) = 2 * A(n-1) + A(n-2)
A(n) = 2 * 7 + 3
A(n) = 17

And for denominator

A(n) = 2 * 5 + 2
A(n) = 12

And we keep on repeating this procedure for 1000 iterations. It started giving answer as Inf towards the end which meant that R was incapable of handling such large values. So to overcome that, we use gmp library. There is a function called mul.bigz to multiply large numbers. Rest all is simple R code.

library(gmp)
num_1 = 3
num_2 = 7
denom_1 = 2
denom_2 = 5
count = 2
total_num = 0

while(count < 1000) {   
  temp = num_1
  num_1 = num_2   
  num_2 = mul.bigz(as.bigz(2), as.bigz(num_2)) + temp
  temp = denom_1 
  denom_1 = denom_2  
  denom_2 = mul.bigz(as.bigz(2), as.bigz(denom_2)) + temp
  count = count + 1
  if (nchar(as.character(num_2)) > nchar(as.character(denom_2)))
    total_num = total_num + 1
}
total_num
#[1] 153

system.time()
#user  system elapsed 
#0       0       0 

 

Euler problem 56 – Powerful digit sum

Powerful digit sum

Problem 56

A googol (10100) is a massive number: one followed by one-hundred zeros; 100100 is almost unimaginably large: one followed by two-hundred zeros. Despite their size, the sum of the digits in each number is only 1.

Considering natural numbers of the form, ab, where a, b < 100, what is the maximum digital sum?

This is another straight forward problem which took me just 15 – 20 minutes to solve. Like last problem, this problem has to go through two loops each one going from 1-99. Raise a raise to b, sum the digits and return the max answer. The logic is simple, however, there are two things which need to be taken care of. First, you need to set the options(digits = 22) and options(scipen=999) because as the numbers increases then they are shown in scientific notations which later creates problem while calculating sum of digits. One cannot sum 1.5e10 and such numbers.

Another small adjustment is you cannot get all the powers easily in base R itself. As the numbers increases, you get wrong answers. There was always a problem with higher precision digits as we have already experienced in previous problems so we use pow.bigz function from gmp library to calculate higher powers sum. We had already written a function to calculate sum of digits, so I used the same function .

library(gmp)
options(digits = 22)
options(scipen = 999)

max_digit_sum = 0
for (a in seq(99)) {
 for (b in seq(99)) {
  cat(a, b, "\n")
  digit_sum = sum(as.numeric(unlist(c(strsplit(as.character(pow.bigz(a , b)), "")))))
  if (digit_sum > max_digit_sum) {
   max_a = a
   max_b = b
   max_digit_sum = digit_sum
  }
 }
}
max_a
#[1] 99

max_b
#[1] 95

max_digit_sum
#[1] 972

system.time()
#user  system elapsed
#0.19   0.00   0.20

 

Euler Problem 55 – Lychrel numbers

The problem statement is quite big. You can view it here.

The question is straight forward especially if you follow the brute force method. Basically we would have two loops. Outer loop would take value from 1 to 10000 while the inner loop would run for 50 iterations.

For every number we add it with its reversed number and check if the sum is a palindrome. We continue reversing and adding until we find a palindrome. We check this only for 50 iterations and any number which exceeds this 50 iteration counter we consider it as a Lychrel number and add them to that list. Simple right?

However, there is only one point where I got stuck a bit for couple of minutes. I was checking if a number is a palindrome and then add it’s reversed digit to it and then check again. However, we first need to add and then check, I was doing it opposite. It is clearly mentioned in the question,

there are palindromic numbers that are themselves Lychrel numbers; the first example is 4994.

So if I check 4994 is palindrome in first step itself, then it will satisfy the requirement of being palindrome and would not consider it as Lychrel number which is obviously wrong. So we first need to add the number with it’s reversed number and only then check if it is a palindrome.

library(stringi)
lychrel_num = numeric()
lychrel_nums_under_n <- function(limit) {
  for (i in seq(limit)) {
    num = i + as.numeric(stri_reverse(i))
    count = 0
    while(num != stri_reverse(num) & count < 50) {
      num = num + as.numeric(stri_reverse(num))
      count = count + 1
    }
    if(count == 50)
      lychrel_num = c(lychrel_num, i)
  }
  return(lychrel_num)
}

I could have written a function to check if a number is a palindrome or not. However, there is already a function in stringi package called stri_reverse which reverses a digit, so I have used the same function.

length(lychrel_nums_under_n(10000))
#[1] 249

system.time(lychrel_nums_under_n(10000))
#user  system elapsed 
#0.75    0.00    0.75 

 

Using bsetools to get live BSE data and send email

So we did publish a package on PyPI which fetches BSE share prices. However, although the package was already released, it was not being used for the purpose it was created at first place. I was getting data from NSE which should now be changed to BSE as the package was up and running.

The code is more or less the same which we had for nsedata, the only difference now is we use bsetools package instead of nsetools. After installing bsetools with

pip install bsetools

We read the csv where all the details of quotes are stored. We use the get_quote function for every quote to get its respective prices from BSE website. We then create a data frame with required columns. We use the email_main function which is same as we have used it previously.

Euler Problem 54 – Poker Hands

Poker hands

Problem 54

For a change I am not copying the problem statement here because it is way too long. You can have a look at it Project Euler Problem 54 . I solved a project Euler problem after a long time. There are couple of reasons for that. First, I was busy with other things. I recently released a python package bsetools which gets share prices from BSE website, so most of the time went there. Then I finally shifted focus on solving this problem and guess what? This was so difficult to solve. There were so many cases, edge cases which needed to be handled. There were just so many possibilities which needed to be covered. On top of that I am not a poker player, so this game was quite new to me. I had some idea about it though but I haven’t played it to understand the small cases. Read it completely couple of times, talked with some of the colleagues in office who play poker regularly to get answer to some questions.

So after all those discussions and thinking, I thought of making different functions for all the cases mentioned in the question.  So, a separate function which checks if hand is a royal flush, a separate function for straight flush, four of a kind and so on. Initially I thought I would give ranks/points to these functions and then compare them against each other to decide which player has better ranked hand. For example – For Royal Flush we can give 9 points, straight flush 8 points, four of a kind 7 points and so on. So If player 1 has a straight he will get 4 points and player 2 has full house then he will get 6 points. So in this case player 2 would win as it has got higher point. This I thought was a good approach until more complications kicked in. What if there is a tie? If both player 1 and player 2 have a pair then we need to check which pair has a higher valued pair, if even then it is a tie then we need to check in those hand which player has got a higher value card. This is just one case of complication, there are many others for different conditions as well.

The functions which I had written just returned TRUE / FALSE and not the values. is_a_pair only returned TRUE/FALSE based on if it had a pair or not and not which value that pair contains. Now to handle the above cases we needed another function which was named as break_ties which breaks the ties and returns higher valued pairs. Each of function returns “Player 1” or “Player 2” according to higher ranked cards. We then calculate number of hands “Player 1” has won.

I divided this program into 3 separate files. Poker_card_supporting_functions.R has all the base functions like is_a_pair, is_straight, is_flush etc. Get_hand.R has the function get_hand which calls all the functions present in supporting functions file. It receives a hand and checks one by one if it is a pair, a flush , straight etc. It handles various cases as well. If there is a tie then the function break_ties is called.  Poker_Hands.R is the outside function which reads the poker txt file, divides each row into separate hands for player 1 and player 2 (first five columns are player 1 hands, last 5 are player 2’s) and then sends the two hands to get_hand function. It stores the return output (Player 1 or Player 2) in output variable and we calculate the frequency of their occurrence.

Looking back I think the code can be simplified definitely. There are lot of unnecessary function calls which can be reduced as well and made readable but now I had already given too much of time into this and was exhausted already and just wanted to get over with it. I was happy that I reached the answer at least.

source("/Users/Ronak Shah/Google Drive/Git-Project-Euler/51-60/54.Get_hand.R")

df <- read.table("/Users/Ronak Shah/Downloads/p054_poker.txt", sep = "", header = F, stringsAsFactors = FALSE)
output <- character(nrow(df))

for (i in 1:nrow(df)) {
  hands_1 <- as.character(df[i, 1:5, drop = TRUE])
  hands_2 <- as.character(df[i, 6:10, drop = TRUE])
  output[i] <- get_hand(hands_1, hands_2)
}

table(output)
#output
#Player 1 Player 2 
#376      624 

#user  system elapsed 
#2.32    0.00    2.35 

The complete can be found here.

bsetools – Get BSE share prices

In the last article I showed how to get prices from BSE’s website. As there is no package/library for bse prices I thought of creating a package and uploading it to PyPI. I decided the name of the package to be bsetools as it sounds similar to nsetools. So having found either of our package the user can easily find the other package if they wish to use the package. I had no idea how to upload packages to Pypi as I had not done it before. After googling a bit I found two posts really helpful to upload a package to PyPI which helped me a lot.

  1. Peter Downs blog  and
  2. Pycharm blog 

Apart from the code files I uploaded setup.py and LICENSE.txt files. LICENSE.txt file contains the MIT License which allows the conditions to share, distribute and reuse this code. setup.py contains all the information of the package. The install_requires part is used to mention all the package dependencies which needs to be installed to use this package. So in case if the person installing the package does not have these package, it automatically gets downloaded into their system so that they can use the package without any blocking.

Once the github repo is clean and ready for release of the package, we need to tag it.

git tag tag_name

Where tag_name is usually the version number which you are uploading.
Then do

git push --tags origin master

After the tag is pushed. We can upload the package to pypi directly by doing:

python setup.py sdist upload -r pypi

There are few changes which I included in this release.

  • I added support for python 3.5+
  • In case if wrong share price is requested which does not have bse page then it returns the error message.

The link to package on Pypi is here : PyPI Bse tools and the github repo is at
Github bsetools

So if you want to install and start using bsetools package you just need to do

 pip install bsetools

 

Getting BSE data

I had recently written a blog on getting NSE data and sending an email with the prices. That used the library called nsetools to get the prices. I actually needed prices of share from BSE website but there is no package called bsetools or any other one. So I decided to try to do it myself. I checked the source code for the nsetools package and basically they have scraped the prices from nseindia.com – the official website for NSE. So if you check the prices on nseindia then you would find that it has https://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuote.jsp?symbol= this link as constant and the part after “=” is taken based on the share price which we want.  Therefore it was necessary to get the symbol of the share for which we want the prices. So for Infosys the link becomes https://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuote.jsp?symbol=INFY and so on for all other shares as well.

I went to the bseindia.com website but unfortunately there is no direct way to get the share prices link. For example , for Infosys the link is : http://www.bseindia.com/stock-share-price/infosys-ltd/infy/500209/ , for tech Mahindra it is http://www.bseindia.com/stock-share-price/tech-mahindra-ltd/techm/532755/ so I can get the symbols infy and techm as the share symbols but there is are extra two unknowns here. For infy , it is “Infosys-ltd” and some number (500209) and same with tech Mahindra (tech-mahindra-ltd) and 532755. So this is extra and I don’t know from where to get this information for each share price. So I couldn’t dynamically create a link for BSE. I started looking for some other websites/resources that I can fetch the data from. There were couple of websites (rediff, moneycontrol etc) which gave the share prices but their URL needed modification in some way or the other as well. Another thing which I found was if you google “Infosys share prices” you would get bse share prices of Infosys there itself. Now this was easier because I was sure Google has a python API which I can use to query. However, after comparing the prices from Google website and BSE website I found that they were not equal, there was some difference in the price. It is not worth the effort if you can’t return the correct prices. So I decided to scrape the prices from bse website only. I didn’t know how but at least the prices were reliable if we select it from their original website.

So then I had an idea of using Google’s search API to get to the BSE website. We can search for share price on Google and then we will get a list of results and one of them would be bseindia’s website which would be present mostly in top 10 list of search result only. So I just filter the result and select the link which has bseindia in it. So with that we have the URL and we can go to the respective bse website’s link and then extract the prices.

After we reach to that respective page it took a little bit of time to find out how to extract the value (share price) from that page. I am using PhantomJS and BeautifulSoup to extract the data from the website. Basically there are only two classes which we need to consider to get the prices, “tbmainred” and “tbmaingreen” . So if the prices are greater than previous day then it takes class tbmaingreen and if the price is lesser than previous day then it takes tbmainred class. So I just need to search for that class and extract it’s value. I did create a class so I can use it from a third-party file to get the price.

I added an extra file (test.py) which reads the data from csv and extracts prices for all the shares listed in the csv and appends it as a new column in csv. Maybe I will use this to get email of prices in the future.